Text Mining for Big Data Analysis in Financial Sector : A Literature Review

Big data technologies have a strong impact on different industries, starting from the last decade, which continues nowadays, with the tendency to become omnipresent. The financial sector, as most of the other sectors, concentrated their operating activities mostly on structured data investigation. However, with the support of big data technologies, information stored in diverse sources of semi-structured and unstructured data could be harvested. Recent research and practice indicate that such information can be interesting for the decision-making process. Questions about how and to what extent research on data mining in the financial sector has developed and which tools are used for these purposes remains largely unexplored. This study aims to answer three research questions: (i) What is the intellectual core of the field; (ii) Which techniques are used in the financial sector for textual mining, especially in the era of the Internet, big data, and social media; (iii) Which data sources are the most often used for text mining in the financial sector, and for which purposes? In order to answer these questions, a qualitative analysis of literature is carried out using a systematic literature review, citation and co-citation analysis.


Introduction
The financial sector generates a vast amount of data like customer data, logs from their financial products, transaction data that can be used in order to support decision making, together with external data, like social media data and data from websites.Finacle Connect (2018) [1] indicates the top 10 technologies for financial industries, including the rise of API economy, cloud business enablement, blockchain for banking, and usage of artificial intelligence.Turner et al. (2012) [2] in the Executive Report prepared for the IBM Institute for Business Value indicate that 71% of banking and financial institutions use big data analytics for generating a competitive advantage relevant for their organizations.The same authors state that in 2010 there were 36% of such banking and financial institution, indicating the increase of 97% in two years.This increase points out the relevance of big data technologies in today's business for long-standing business challenges in the banking and financial sector.Applications of big data in the financial sector are various, including social media analysis, web analytics, risk management, fraud detection, and security intelligence.One of the possible roads to extract information from the vast amount of big data is text mining or text analytics (Pejic-Bach et al., 2019) [3].
The aim of text mining (also referred to as text data mining and text analytics) is to analyze textual document (including emails, reviews, plain texts, web pages, reports, and official documents) in order to extract the data, transform it into information and make it useful for various types of decision making.Text mining encompasses linguistic, statistical, and machine learning techniques, which can be used, in its final stage for analysis, visualization (via maps, charts, mind maps), for integration with structured data in databases or warehouses, for machine learning, etc.Although conducted on unstructured text, one of the first tasks is to organize it and structure in the way suitable for further qualitative and quantitative analysis.Text mining extracts relevant words (N-grams) and relationships between them in order to categorize them and make conclusions relevant to a business problem or scientific inquiry.In other words, the goal of text mining is the extraction of knowledge and patterns from various text documents (Zhai, Velivelli, and Yu, 2004) [4].Some of usual text mining undertakings include classification and clustering of phrases or topics, named entity recognition, information extraction (Yehia, Ibrahim, and Abulkhair, 2016) [5], sentiment analysis (Schumaker, Zhang, Huang, and Chen, 2012 [6]; Nakayama et al., 2018 [7]), keyword extraction, natural language processing (NLP) (Ong, Chen, Sung, and Zhu, 2005 [8]; Klopotan, Zoroja and Meško, 2018 [9]) including tagging, parsing, topic detection, etc. Herráez, Bustamante and Saura (2017) [10] used text mining in order to extract topics using content analysis of e-commerce organizations, while Reyes-Menendez et al. (2018) [11] used text mining in order to extract topics from social media and to classify them according to sentiments.
Increased interest has been paid to multilingual text mining in order to get insight into information across languages.
Text mining has gained its popularity with big data resources when analyzing big data in the financial sector.This way, financial organizations can identify valuable information from customer opinions, corporate documents, and posts on social networks, e-mails, call logs, detect customer churn, fraud or risks, etc.In order to highlight the various aspects of the use of textual mining in banking and finance, this study aims to answer three research questions: (i) What is the intellectual core of the field?(ii) Which text mining techniques are used in the financial sector for textual mining, especially in the era of the Internet, big data, and social media?(iii) Which data sources are the most often used for text mining in the financial sector, and for which purposes?
In order to answer these questions, a qualitative literature analysis is conducted using a systematic literature review, citation, and co-citation analysis.These methodologies allow mapping and analysis of the evolution of the scientific field (Batistič, Černe, and Vogel, 2017) [12].Besides, in order to consider second and third research question, the paper provides an overview typical text mining techniques used in the financial sector and analyses them according to the type of data sources used, as well as according to their typical business applications.
This paper aims to contribute to both theory and practice.Through the citation and co-citation analysis and the answers to the first and second research questions, primary theoretical contributions reflect in summing up the conclusions and research trends of the field.In addition, the second and third research question offers practical contributions through a summarized overview of the presenting relevant text mining techniques according to data sources used and typical applications.
The paper is structured as follows.The introductory part examines the impact of big data analysis on the financial sector with an emphasis on text data analysis.In the second chapter, a survey of similar literature reviews focusing on data mining and text mining applications in finance has been presented, which is used for developing research questions.The third chapter presents the methodology used for conducting the research, and the steps of the research process were presented.The third part presents the results of citation and co-citation analysis.The fourth chapter presents various text mining techniques used in finance.The fifth chapter provides the analysis of data sources used and typical applications for text mining in finance.Finally, conclusions are given, and further research is proposed.

Research Questions
Since the emergence of data mining, as the advanced data manipulation, processing, and modeling approach, the interest in its usage in finance has grown exponentially, which generated the need for literature reviews that could provide a focused outline to advantages and disadvantages of data mining utilization in finance for the researchers and practitioners (Zhang and Zhou, 2004) [13].Numerous literature reviews were conducting focusing to different aspects of data mining applications in various fields of finance, such as stock markets predictions (Hajizadeh, Ardakani and Shahrabi, 2010) [14], financial fraud detection (Ngai et al., 2011) [15], and financial risk analysis (Jin, Wang and Zeng, 2018) [16].
Text documents that are an abundant and dominant source of relevant information in the business domain are unstructured, which is an obstacle in the fast processing of the information stored in them.Therefore, text mining as the automated approach to the analysis of various text documents emerged as the attempt to make use of text-based unstructured information.In spite of the importance of text mining for finance, only recently literature surveys investigated the utilization of text mining for financial applications.Some of the literature reviews only sporadically included text mining applications.Ngai et al. (2011) [15] developed a classification framework for analyzing data mining applications in financial fraud detection, such as classification, clustering, visualization and outlier detection, regression, and prediction.They identified one text mining application in their analysis, using Naïve Bayes text mining in order to employees likely to conduct fraud (Holton, 2009) [17].Gray and Debreceny (2014) [18] designed a research taxonomy for fraud detection in financial statement analysis, identifying the following usage of text mining in this area: deception analysis, as a relevant tool in fraud detection.
Review papers that focus solely on text mining rarely investigate financial applications.Sun, Luo and Chen (2017) [19] developed a review of natural language processing applications for text mining in order to extract opinions.In their review, they focused on various approaches, such as comparative opinion mining and deep learning.Nassirtoussi et al. (2014) [20] developed a theoretical and practical review of applications of text mining for market prediction, which focuses mainly to online sentiment analysis using social media and news texts, and their utilization for prediction of FOREX market and stock exchange markets.
Kumar and Ravi (2016) [21] conducted a literature review of text mining applications in finance.They surveyed papers published from 2000 to 2016 and developed the following groups of text mining applications: FOREX and stock market forecasts, customer relationship management applications, as well as security applications, focusing to cybersecurity.In their analysis, they focused on text mining algorithms, such as decision trees, neural networks, linear regression and logistic regression.In their work, they do not provide citation and co-citation analysis.Systematic analysis of text sources utilized for text mining is not provided in their research.
Based on this review of similar research, the following gaps are identified.First, due to the exponential growth of text mining utilization in the field of finance, there is a growing need to provide an up-to-date review, tracking the cutting-edge state-of-the-art research.Therefore, the first research question has been outlined as: What is the intellectual core of the field?aiming to provide the answer with the longer research period (from 2000 to 2019), and using citation and co-citation analysis.Second, the aim of the paper is to detect the most used text mining techniques, taking into account the most recent advances in the field., such as big data.Therefore, the second research question is posed as Which text mining techniques are used in the financial sector for textual mining, especially in the era of the Internet, big data and social media?Finally, in order to capture the venues for the future research as well as to provide the practitioners with the outlook on how to use text sources available to them online, and in their organizations, the third question is posed as: Which data sources are the most often used for text mining in the financial sector, and for which purposes?

Methodology
In order to provide the answer to the research questions of this study, multiple research methodologies, such as bibliometric techniques co-citation and citation analysis, and systematic literature review (SLR).SLR is outperforming informal literature review "with respect to the planning for literature review, the design of search string, sources to be searched, publication inclusion and exclusion criteria, publication quality assessment and the data extraction process" (Niazi, 2015, p. 845) [22], since SLR refers to "identifying, assessing, and interpreting available research studies with the purpose to provide answers to the research question" (Wahono, 2015, p. 1) [23].Following the recommendations of Wahono (2015) [23] and Wang et al. (2016) [24], the research steps were created and used in our SLR analysis (Figure 1).research studies with the purpose to provide answers to the research question" (Wahono, 2015, p.1) [23].Following the recommendations of Wahono (2015) [23] and Wang et al. (2016) [24], the research steps were First, based on the analysis of previous literature reviews of text mining and data mining in finance (as presented above), the need for a systematic review in that area has been identified (Phase 1).The review protocol (Phase 2) and the evaluation of the review protocol (Phase 3) are related to the search strategy and study selection process (Wahono, 2015) [23].The review protocol is evaluated in relation to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) standard for writing SLRs (Moher et al., 2006) [25].Following the established bibliometric protocol, the data for the research is acquired by searching publications using the search string ("text mining" AND finan*) to be found in all fields, for the period from 2004 (Phase 4).The search was conducted in October 2018, in the Social Science Citation Index (SSCI), Science Citation Index Expanded (SCI- First, based on the analysis of previous literature reviews of text mining and data mining in finance (as presented above), the need for a systematic review in that area has been identified (Phase 1).The review protocol (Phase 2) and the evaluation of the review protocol (Phase 3) are related to the search strategy and study selection process (Wahono, 2015) [23].The review protocol is evaluated in relation to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) standard for writing SLRs (Moher et al., 2006) [25].Following the established bibliometric protocol, the data for the research is acquired by searching publications using the search string ("text mining" AND finan*) to be found in all fields, for the period from 2004 (Phase 4).The search was conducted in October 2018, in the Social Science Citation Index (SSCI), Science Citation Index Expanded (SCI-EXPANDED) and Emerging Sources Citation Index (ESCI) databases, and it generated 345 items.After a manual analysis of the literature, the final list of 123 publications was used for this analysis (Phase 5).This phase has been presented in detail in Figure 2. Data were extracted from primary studies by manual analysis and extraction of the most used text-mining techniques (Phase 6 and Phase 7).Bibexcel and Pajek software are employed to identify important papers and conduct co-citation analysis of papers indexed in Web of Science (Batistič, Černe and Vogel, 2017) [12], for the purpose of conducting Phase 8 and Phase 9. Citation and co-citation analysis were used in order to detect the most relevant work in the field (Phase 10, and Phase 11).The most cited papers were discussed in relation to the identified text mining techniques in the financial sector (Phase 12).
Figure 2 presents the PRISMA flow diagram, outlining the process of development of the final list of papers included in the analysis, following the practice used in numerous systematic literature research, such as Saura et al. (2017) [26].Initially, 345 papers were identified by searching Web of Knowledge with the search term "text mining" AND financ*.In order to select papers that focus to the utilization of text mining to finance, titles and abstracts of 345 initially extracted papers were examined.Among them, 186 papers were not related to the topic, e.g., used phrase text mining in a different context or mention the term finance sporadically-this approach resulted in 159 potentially adequate papers.Authors have read the full text of these papers and excluded additional 59 papers that are not related or are vaguely related to the research topic, or which provide an only shallow description of text mining approach.Additional 23 papers were tracked by snowballing approach, using references of papers.Therefore, the final list of 123 papers is developed, which is the focus of our analysis.Rest of the paper focuses on these 123 papers that are focused on the use of text mining in various applications.Appendix A provides the list of papers included in the literature review.After a manual analysis of the literature, the final list of 123 publications was used for this analysis (Phase 5).This phase has been presented in detail in   2017) [26].Initially, 345 papers were identified by searching Web of Knowledge with the search term "text mining" AND financ*.In order to select papers that focus to the utilization of text mining to finance, titles and abstracts of 345 initially extracted papers were examined.Among them, 186 papers were not related to the topic, e.g., used phrase text mining in a different context or mention the term finance sporadically-this approach resulted in 159 potentially adequate papers.Authors have read the full text of these papers and excluded additional 59 papers that are not related or are vaguely related to the research topic, or which provide an only shallow description of text mining approach.Additional 23 papers were tracked by snowballing approach, using references of papers.Therefore, the final list of 123 papers is developed, which is the focus of our analysis.Rest of the paper focuses on these 123 papers that are focused on the use of text mining in various applications.Appendix A provides the list of papers included in the literature review.

Citation and Co-Citation Analysis
RQ1: What is the intellectual core of the field?Throughout the whole period, an exploratory focus is evident, i.e., financial forecasting or prediction of financial markets.A number of papers were initially low.In the period from 2001 to 2009 less than 5 papers were published per year, while in the period from 2011 to 2014 less than 10 papers were published per year.However, the number of papers increased to 15 in 2015, followed by an exponential increase in the following years (14 papers in 2016, 18 papers in 2017, 27 papers in 2018).It can be presumed that this increase is the result of the overall interest of the financial community in various data analysis approaches, which generated the creation of diverse solutions covered by FinTech umbrella (Arner, Barberis and Buckley, 2015) [27].Figure 3 presents the number of citations on this topic over the examined period with the continuous increase in the number of published studies.A citation network of all papers has been created in order to identify the most important documents in the network, and therefore the trends of the field research.First, a citation analysis of identified papers has been conducted.Citing occurs when one study refers to other papers as source ones, and the citation analysis provides information on papers and their sources and provides the insight into the overall citations number (Wang et al., 2016) [24].Co-citations are considered as citing two papers together in different paper.The co-citation analysis identifies the sources of the search in the observed academic area, by the detection of the most relevant papers, as well as generating indicators of the paper impact, since the number of citations are co-related with the relevance of the work in the particular academic community (Batistič, Černe, and Vogel, 2017) [12].The observed number of citations of papers is 1107, while the average citations number is 9.
The research focus of these studies is presented in Table 1.
Table 1.Citation analysis of the most cited papers in the Web of Science.

Total Citations Objectives
Text mining for market prediction: A systematic review Nassirtoussi, Wah, Aghabozorgi, and Ngo (2014) [20] 97 The study presents an overview of studies related to the market forecast based on online text mining and creates an outlook to the main elements.In addition, the paper presented a scomparison of all systems with the identification of the main differentiating factors.
Identification of fraudulent financial statements using linguistic credibility analysis Humpherys, Moffitt, Burns, Burgoon, and Felix (2011) [28] 73 The study analyses corporate fraud detection "through a unique application of existing text-mining methods on the Management's Discussion and Analysis and tests for linguistic differences between fraudulent and non-fraudulent MD & As" ( [30] 51 The study analyzes the effect of adding text information to the churn prediction system that uses only traditional marketing information. On the other hand, "the importance of a paper can be determined by its influence in the citation network, which can be measured by two indexes, degree centrality, and betweenness centrality.Degree centrality is measured as the number of direct ties that a node in the network has, while betweenness centrality implies the extent to which one node exists on the shortest path between other nodes" (Wang et al., 2016, p. 35) [24].In this regard, the co-citation analysis has been conducted, and the co-citation network plotted in Figure 4 was created.Following the usual practice, the density of the network is calculated, which outlines the connections between the network nodes.If density is above 0.5, it is considered as high (Abrahamson and Rosenkopf, 1997) [31].In this research, the network density is 0.754 that indicates that the links between the papers are quite abundant.According to degree centrality and betweenness centrality measures, the most influential papers are those published by Schumaker and Chen (2009) [32], Schumaker et al. (2012) [6], Tetlock (2007) [33], Loughran and McDonald (2010) [34], Bollen, Mao, and Zeng (2011) [35], Pang and Lee (2008) [36], and Hagenau, Liebmann, and Neumann (2013) [37].They will be briefly discussed in order to detect the research trends in the field.
In addition to most cited studies, studies with the highest degrees centrality deal with similar topics.Thus, Schumaker and Chen (2009, p. 1) [32] "examine a predictive machine learning approach for financial news articles analysis using words and phrases representations."Tetlock (2007, p. 1139) [33] analyses the relationship "between the media and the stock market using daily content from the Wall Street Journal column", while Loughran and McDonald (2010, p. 36) [34] develop a negative word list that "reflects tone in financial text and link the lists to returns, trading volume, return volatility, fraud, material weakness, and unexpected earnings".Furthermore, Bollen et al. (2011) [35] analyze the text content of daily Twitter feeds and assess its utilization for stock-market predictions.It is similar to Hagenau et al. (2013) [37] who examine if financial news could be used for the prediction of stock prices.These studies and authors represent the field's intellectual core and can be considered the most influential in the field.Besides, the bibliographic co-citation analysis points to two studies as a source of the field dealing with the portfolio analysis model for a stable Paretian market and with the theoretical and empirical literature on the efficient market models.In other words, intellectual core papers are related to financial markets rather than to text mining.The oldest cited paper dealing with text mining of web data was published in 1998 with the aim of presenting techniques for the exploitation of textual financial news and analysis results (Wuthrich et al., 1998) [38].Then, Tumarkin and Whitelaw (2013, p. 41) [39] analyzed the "relationship between Internet message board activity and abnormal stock returns and trading volume."After 2003, there is a growing interest in this research topic.

Text Mining Techniques in Finance
RQ2: Which text mining techniques are used in the financial sector for textual mining, especially in the era of the Internet, big data and social media?
To address the second question, a comprehensive search has been conducted using the same string, but in other databases.In order to narrow the research, and given the importance of big data for the information era, this research is focused on big data environments rather than simple text mining technologies.Big data denotes the vast and complex amount of data in structured documents, semi-structured documents (documents having structure, but differentiating among themselves, e.g., XML documents, HTML files) and unstructured type of documents (in terms of record layout, embedded metadata).Mathew (2012) [40] points out several issues in big data analytics: diversification of data types along with the vast amount of data, more changes and uncertainty, more unanticipated questions, and real-time needs and decision-making.
Many companies and institutions have a large amount of data, which can be exploited and used to extract data and create information linked to new knowledge.In this research process, text mining can help in customer analytics, marketing opportunities, for fraud prevention, to improve operational activities and to develop new business models.When it comes to financial institutions, two primary data sources can be used for text mining: external and internal data sources.Internal data can be transaction data, log data, application data.External data can be from any social media and website.In the digital era, big data is generated from many increased sources, including online clicks, mobile transactions, social media, and data generated through sensor networks.(Yehia et al., 2016) [5].While financial institutions already have some inputs from their customers in some form, benefits of big data technologies are to compare institution with competitors with same objective metric (usually machine learning generated metric).In this sense, text mining can be integrated into the business intelligence process and business applications, which represents the promising target.Fan et al. (2006) [41] suggest future development in the integration of data and text mining used to discover hidden information in various types of documents collected from various resources, with the purpose of improved decision-making solutions.
In order to classify the text mining techniques which are the most relevant for big data analytics in the financial sector, expert panel approach was used in order to provide the classification of techniques that reflect the usage of text mining in practice.Four experts were selected from the big data specialized tech-companies that deliver the service to various financial organizations (e.g., banks, insurance companies, and stock-market exchanges).This approach is following the recent practice that proposes the inclusion of from-the-field experts in order to gain results which are tightly related to practical usage of techniques on a day-to-day basis (Best et al., 2009 [42]; Bannan-Ritland, 2003 [43]).Experts have selected the following text-mining techniques used in big data analytics as the most relevant for financial sector: keyword extraction, named entity recognition, gender prediction, sentiment analysis, topic extraction, and social network analysis.In the rest of the chapter, the above-mentioned techniques are presented, and for each technique, selected studies dealing with this technique are presented.

Keyword Extraction
With new technologies and analysis in recent times and especially in the case of big data analytics with vast volumes of new data coming from different sources, there is a need for keyword extraction.Table 2 presents the selected papers dealing with the "keyword extraction" technique in the financial sector.

Authors Research
Hasan and Ng (2014) [44] Automatic keyword extraction Roh, Jeong, and Yoon (2017) [45] Multilayered keyword extraction methodology for structuring technological information through natural language processing (NLP), h purpose: discovering trends in patent analysis, technology classification or knowledge flow among technologies Eler et al. (2018) [46] Pre-processing steps with impact on text mining techniques: lowercasing, deletions, stemming/ lemmatization, PoS (Part-of-Speech) tagging, parsing Keyword extraction plays a key role in text mining financial applications.The simple form is when a list of keywords is needed in order to extract related comments and articles from an external source.More complex, but sophisticated usage, would be to use automatic keyword extraction (Hasan and Ng, 2014) [44].This field gains vast interest in the past several years since volumes of data are growing and every document or comment cannot be read sequentially.The goal is to extract "sequence of words," called N-grams, through a semi-automated process.However, this process does require manual validation and comparison with the reference model, i.e., "gold standard" in order to assess the quality of the tool.Quality of terminology has gained importance regarding costs, user perceptions, customer satisfaction.
To summarize keyword extraction has 4 approaches (Bharti et al., 2017) [47]: statistical (term frequency, inverse document frequency), linguistic (WordNet, n-Gram, PoS (Part-of-Speech) patterns), machine learning (Naïve Bayes) and hybrid approach (some combination of previous three approaches. In most of the cases, pre-processing techniques are needed, starting from the corpus collection, where documents can be collected from one or more sources, depending on various criteria, including corpus size and domain.A possible step could be deduplication of documents or articles, preformatting, possibly scanned and converted by OCR (optical character recognition).Once having a text in the appropriate format, the text is tokenized by which text is characterized as a list of words, numbers, signs, and punctuation and treated as "bag of words."Pre-processing steps through various methods have a substantial impact on text mining techniques (Eler et al., 2018) [46].Through lowercasing the whole text (all tokens) are converted into lowercase, where some mistakes can happen (e.g., converting of abbreviation US into us as a pronoun).To reduce noise in the text, there are various techniques like deletion of double spaces, numbers, names (if needed), punctuation, rare words, and stop-words The next step in reducing dimensionality is the introduction of stemming or lemmatization tasks on keywords in order to gather all variations of specific keywords (example: bank, banking, banks -> bank).Lemmatization uses PoS (Part-of-Speech) tagging to identify grammatical categories.This feature can be useful in the parsing algorithms to detect the correct POS word or to extract the sequence of words (N-grams).Many text mining tools use stemming which uses cutting of affixes (banking) -> bank+ing.This feature is useful in later use and especially for mention counting or online presence metric.This metric data practically counts how many specific keyword names exist for a particular page name or username.This online presence metric can be used for institution comparison in the financial sector (ex.bank1 v bank2).
Text mining can also be used in discovering trends in patent analysis, technology classification or knowledge flow among technologies as in Roh, Jeong, and Yoon (2017) [45].They proposed multilayered keyword extraction methodology for structuring technological information through NLp.As pure keyword, extraction has deficiencies such as omitting meaningful keywords, they suggested to "meaningful keyword sets related to technological information.Firstly, they analyzed the characteristics of technological information" (Roh, Jeong, and Yoon, 2017, p. 1) [45], structured it by information type and then performed keyword extractions in each type through NLp.

Named Entity Recognition
Named entity recognition represents one of the key phases in text mining (Saju and Shaja, 2017) [48] used on large corpora of data, which can be used in information retrieval and extraction and further in NLP, machine translation and question-answering system, speech recognition, natural language generation, chatbots conversation, machine learning, document indexing, image recognition, etc.Many industries use named entity recognition on big data sets.Most of named entity recognition techniques use methods of machine learning, which requires large amounts of data in order to train a good classifying algorithm.Table 3 presents selected papers dealing with the "named entity recognition" technique.Table 3. Selected papers dealing with the "named entity recognition" technique.

Authors Research
Alvarado, Verspoor, and Baldwin (2015) [49] Named entity recognition analysis on financial documentation and publicly available non-financial data set to extract information of risk assessment Ritter et al. (2011) [50] Supervised approach for named entity recognition Named entity recognition is a process, which labels a Name-i.e., a sequence of words in documents, which denote email, amounts-currency, company/bank/institution name, brand name, city-state name, time, or others (Grishman and Sundheim, 1996) [51].The three universally accepted name entities are a person, location, and organization.Named entity recognition consists mainly of two steps: detection of names in the text and classification by the type of entity, but also discovering relationships among entities.In the detection process, problems of segmentation can appear (e.g., National Bank of Croatia which is a single name, instead Croatia being a location), followed by classification, depending on annotated corpora.Named entity recognition could have its business value in industrial applications, as in bank transaction details, to detect contracts, e-mails, machine translation, question answering, spell checking, etc. Alvarado, Verspoor, and Baldwin (2015) [49] conducted named entity recognition analysis on financial documentation and publicly available non-financial data set to extract information of risk assessment.

Gender Prediction
Information about gender is often useful, especially when the emphasis of analysis is marketing planning and/or better understanding of customers.Table 4 presents selected papers dealing with the "gender" prediction technique in the financial sector.Table 4. Selected papers dealing with the "gender prediction" technique.

Authors Research
Phuong and Phuong (2014) [52] Users' gender based on browsing history, important for marketing and personalization Kucukyilmaz et al. (2006) [53] Gender prediction in computer-mediated-communication/ chatbots Lotto (2018) [54] Gender prediction to predict financial inclusion, compared with traditional banking services The simple approach to solving this problem is to make a dictionary of female and male names and then match that dictionary with usernames.This can be the right approach if fast results are needed.Still, when social media and websites are analyzed, the problem arises related to the number of accounts from different organizations, bots, and fake accounts with random names.In that case, the presented approach will only recognize what is in the dictionary, thus lowering the probability of recognizing the gender of the customer.
To solve this limitation next step would be to use natural language processing models such as "bag of words" and n-grams or a combination of both.This approach analyses word usage and differences between them and the difference between styles.Disadvantage again occurs in case of the data extracted from social media.Features used for this classification task are (Zhang and Zhang, 2010) [55]: words (authors suggest that binary representation is more effective-word exist or not in document), average word or sentence length, POS tags (noun, verb, adjective, and adverb), word factor analysis-finding groups of similar work (there are 20 lists-example of conversation list is known, care, friend, saying).Information gain can be used as feature selection and with SVM as a classifier, with the accuracy above 72%.
Friedmann and Lowengart (2016) [56] conducted an analysis to explain gender differences when choosing banking services.Galli and Rossi (2014) [57] performed research on gender in the credit market for 7 European countries in the period of financial crisis.Other authors used gender prediction using various text mining sources, such as browsing history (Phuong and Phuong, 2014) [52], and chatbots (Kucukyilmaz et al., 2006) [53].

Sentiment Analysis
Sentiment analysis or opinion analysis is used in the financial sector to identify the "voice of customers."Table 5 presents selected papers dealing with the "sentiment analysis" technique in the financial sector.
Table 5. Selected papers dealing with the "sentiment analysis" technique in the financial sector.

Authors Research
Pang and Lee (2008) [36] Sentiment analysis for determination of writer's attitude towards the specific topic Nopp and Hanbury (2015) [58] Sentiment analysis to detect risks in the banking system Narayanan et al. (2013) [59] Algorithms with correct feature selection and noise removal process Sentiment analysis (Pang and Lee, 2008) [36] refers to text analysis or natural language processing techniques, which helps the determination of a writer's attitude towards a specific topic.Usage of sentiment analysis is frequent in the financial domain.Nopp and Hanbury (2015) [58] used sentiment analysis to detect risks in the banking system.Srivastava and Gopalkrishnan (2015) [60] analyzed sentiments for the banking sector in order to assess the functioning of the bank.These narratives are created and disseminated in social interaction.
There are several approaches to build an accurate sentiment model.Some approaches address this problem from natural language processing view, other from machine learning view or, in current years, more specifically, as a deep learning problem.The first approach, based on natural language processing, is to build a dictionary of known negative and positive words.For this task, only extreme polarities and word that can be correctly associated with the polarity are needed.Based on the developed dictionary, the sentiment is calculated by a simple count of words found in a specific document from our dictionaries.Polarity with more discovered words "Wins" and text is then classified.The next approach, based on machine learning, is about creating a large data set, containing documents that are first classified manually (by a human).Based on the classification, the machine-learning model can be developed, that can provide the rules for automated classification.A problem can be addressed as the classification of two classes (positive or negative) or more (e.g., range from 1-5 for sentiment intensity).Features can be unigrams, bigrams, or a combination of both (Go et al., 2009) [61].Document term matrix is built, based on our features and values in this matrix, which can be either frequency like "TF (Term Frequency), TF-IDF (Term Frequency-Inverse Document frequency), or binary representation" (Hussin, 2004, p. 158) [62].In the big data architectures, the machine-learning model can be used on batch data but also in real-time data in order to perform real-time classification.Accuracy can be greater than 80% even with simple algorithms with correct feature selection and remove noises from the data (Narayanan et al., 2013) [59].
The last approach, based on deep learning, the sentiment analysis would be performed using word embeddings, such as word2vec, GloVe (Zhang et al., 2018) [63].Word embeddings are used to represent words as vectors.With this technique, similar words can be mapped to nearby points in continuous vector space.Deep learning is an improvement from other approaches and especially in sentiment classification of relatively small documents (tweets, comments).

Topic Extraction
Topic modeling or topic prediction/ extraction is based on the number and distribution of terms across documents by counting the probability of belonging to a certain topic.Table 6 presents selected papers dealing with the "topic extraction" technique in the financial sector.Table 6.Selected papers dealing with the "topic extraction" technique in the financial sector.

Moro et al. (2015) [64]
Topic detection of a large number of manuscripts using text mining techniques when detecting terms belonging to business intelligence and banking domains (dirlecht allocation model), topics: credit banking, risk, fraud detection, credit approval and bankruptcy

Zhao et al. (2011) [65]
Social media as a source of entity-oriented topics, unsupervised machine learning approach Lee and So Young (2017) [66] Framework to identify the rise and fall of emerging topics in the financial industry Moro et al. (2015Moro et al. ( , p. 1314) [64] performed topic detection of "a large number of manuscripts using text-mining techniques when detecting terms belonging to business intelligence and banking domains".They used latent dirlecht allocation model to detect topics, by using a dictionary of terms in order to detect topics and research directions.They grouped articles into several relevant topics, followed by dictionary analysis to identify relations between terms and topics of grouping articles.This research showed that credit banking was the main trend, with topics of risk, fraud detection, credit approval, and bankruptcy.By this approach, the probability of each document to belong to a certain topic could be estimated.In this way, it is possible to identify topics capturing more attention.Data from social media can be used to find discussed topics at a certain time.Previous research indicates that these data "can be a good source of entity-oriented topics that have low coverage in traditional media news" (Zhao et al., 2011, p. 46) [65].Input in the model should be a matrix of document-terms format with TF-IDF frequencies as values or binary representation (0 or 1).A common approach for topic extraction is unsupervised machine learning approach.
New approaches also take deep learning techniques for topic extraction.Popular word embedding, in this case, is lda2vec, which is a modification of word2vec presented in sentiment analysis (Moody, 2016) [67].Lda2vec uses word2vec principles and expands this to word, document and topic vectors.Topic extraction helps us answer the question "WHAT" is talked about for example the institution or competitors.Usually, topics are represented as word clouds, but they can be visualized by some more complex graphical representation (LDAvis-Intertopic distance map is visualized with PCA). Lee and So Young (2017) [66] proposed a framework to identify raise and fall of emerging topics in financial industry using abstracts of financial business model patents, in order to discover topics from documents, aiming to enable understanding of the changing trends of financial business models over time.

Social Network Analysis
Social network analysis is the process that is based on graph theory and used for a better understanding of social structures.When it comes to SNA, structure refers to nodes and edges.For example, in the case of Twitter, each node would be one Titter user, and each edge is a relationship between two users (the user is connected to another by the user using a retweet).Table 7 presents selected papers dealing with the "social network analysis" technique.
Table 7. Selected papers dealing with the "social network analysis" technique.

Authors Research
Ediger et al. (2010) [68] Metrics for social network analysis: centrality measures, node degrees (used to find users who are highly connected), closeness (goal is to find users who can spread information to others), clustering coefficient, PageRank L'Huillier et al. (2011) [69] Integration of social network analysis and topic detection Mao, Jin, and Zhu (2015) [70] Social network analysis to explore the way that bank customers impact each other Usual metrics calculated with SNA techniques are (Ediger et al., 2010) [68]: centrality measures, node degrees (used to find users who are highly connected), closeness (goal is to find users who can spread information to others), clustering coefficient, PageRank.Social network analysis is a different type of analysis in comparison to text analysis, but it is used here to show how text analysis and its result can be integrated with this analysis (L'Huillier et al., 2011) [69].For example, when identifying users who can easily spread a message to a network of interests (with SNA techniques), textual information from the followers of that user can be used to discover common interests.This information can be used for marketing campaigns to generate the best keywords.Mao, Jin, and Zhu (2015) [70] used SNA to explore the way that bank customers impact each other in order to detect the most influential customers.

Data Sources Used and Typical Applications of Text Mining in Finance
RQ3: Which data sources are the most often used for text mining in the financial sector, and for which purposes?
Financial and banking institutions, being in a competitive environment, seek new ways to reach customers.Presented papers indicate that text mining represents the hidden door for discovering information in a pile of unstructured data collected from various sources.Text analytics aims to discover key points that could lead to new decisions, such as "who," "where," "when," "why," and "how" which could bring new decisions.Some examples of use include customer analytics possibly derived from data acquired from social media and informal conversations, aiming to detect customers, enhance their engagement, or offer specific services, or to develop new business models based on detection of a preferred way of communication.Annual reports, e-mails, external data coming from sensors, transactions or free-form text can be used to enhance services or for risk and fraud detection.The findings are summarized in relation to the type of source used for text mining.As discussed, financial institutions used two primary data sources for text mining: external and internal data sources.
Table 8 presents the most important text mining techniques according to the type of data source: internal or external.In addition, example sources are outlined together with the example applications.It can be noted that in most of the studies examined, the authors used external data.The reasons for this can be twofold.First, authors would prefer to use the external data since they are public and free to use, while the internal data are the ownership of the company and numerous restrictions can apply in using them as a data source for text mining.For example, the nature of financial data used for fraud detection is very sensitive.Financial organizations are reluctant in sharing that information.Papers dealing with fraud detection are usually based on small datasets of just a few hundred samples which is not enough for text mining to extract information that is useful in the real-world situation.Still, these small datasets provide us glimpse into the world of financial fraud and can help us derive a way to text mining usage.Second, the financial sector may be more prone to use external data, and the use of internal data for various purposes is still rare in practice.
Authors use in most of the cases news, social media feeds, patents, and financial statements, as the external sources for text mining analysis.Only for one of the text mining techniques, internal sources are used, specifically legal documents.
Fraud detection has become a significant concern for financial organizations.Several text mining approaches have been developed mostly for large amounts of financial statements.Fraudulent activity can take many forms like money-laundering, insurance fraud, piracy (software), identity theft, and embezzlement and so on.Usually, fraud detection is conducted in the financial sector using quantitative data.For example, different important features can be found in those text files that can help fraud detection (Chye Koh and Kee Low, 2004, p. 463) [71] like "quick assets to current liabilities, market value of equity to total assets, total liabilities to total assets, interest payments to earnings before interest and tax, net income to total assets, and retained earnings to total assets".Challenges of fraud detection in the financial industry are: typical classification problems like feature selection, model optimization and problem domain, the imbalance between fraud types and detection method studied, privacy issues, computational issues like the computational performance of models in real-time systems, fraudster new and innovative ways of making fraud that need to be yet studied.Loughran and McDonald (2010) [34] focused to fraud and unexpected earnings, and Humpherys et al. (2011, p. 585) [28] conducted the "identification of fraudulent financial statements using linguistic credibility analysis".Glancy et al. (2011) [72] present the process of detecting fraud using management statements.The financial statements and their text are like any textual data, unstructured and the goal of text-mining is to give the structure to that data set in order to extract information and knowledge.When it comes to fraud detection, first and one of the most important tasks is to create a larger dataset for training, if possible.This dataset needs to have both fraudulent and non-fraudulent statements from various organizations of different sizes.Quality of this initial step is affecting every other step in the text mining process.Next step is to clean textual data and perform pre-processing steps similar to sentiment analysis case.Standardization of text and structure creation is key in this step.After that, text mining can be used in order to extract characteristics that can help the detection of fraudulent behavior.Tree algorithm and SVM are popular in this detection.After model evaluation and model selection, implementation of a model into the system or "into the wild" to detect new fraudulent statements is needed.Fraud detection is among the most difficult text mining techniques (some fraud types have higher success like credit card transaction fraud).Systemic Functional Linguistics theory can also be used for fraud detection (Dong et al., 2016) [73].This approach is all about to feature creation and text classification.Authors proposed feature set that can be used as they stated to achieve above baseline accuracy.Their example is a presentation of new information to investors so they can bring better decisions, to auditors to recognize fraud risk, to regulators to investigate only suspicious behavior and firms.Some of the features generated under this approach are the ratio of positive and negative words and total number of words, LDA topics, the total number of the first person singular pronouns, and the ratio of words number and sentences number, TF-IDF weights and other.All features can be divided into the next categories: Ideational (topics, opinions, emotions), Interpersonal (modality, personal pronoun), and Textual (writing style, genre).
Customer relationship management has traditionally been based on the internal databases of customers (Zekić Sušac et al., 2015) [74], and various data mining approaches have been used in order to improve it (Furner et al., 2012) [75].Named entity recognition has recently become a rich source of information relevant for customer relationship management, since it allows financial institutions, to for the extraction of client names, bank account numbers, IBAN from their internal databases and link them to external sources, such as social media.There are dictionaries with predefined named entities that every organization can use for quick start and result.For better results, solutions that are more complex are needed.Since tweets and comments from social media and websites usually lack context and are noisy, there are more complex solutions like supervised approach for named entity recognition (Ritter et al., 2011) [50].Gender recognition is also relevant to customer relationship management.While Charness and Gneezy (2012) [76] investigated gender differences in different countries in risk-taking, Lotto (2018) [54] used various determinants, among which gender prediction, to predict financial inclusion, and compared it with traditional banking services.Therefore, gender recognition of customers using text mining can be of high significance.For example, Phuong and Phuong (2014) [52] performed research on predicting users' gender based on browsing history, important for marketing and personalization.Kucukyilmaz et al. (2006) [53] performed an analysis of gender prediction in computer-mediated-communication/ chatbots.
Stock price prediction aims to determine the future value of an organization and their stocks.This information can bring more profit to the information owner and hence the great interest in these analyses.The hypothesis is that company stock prices can be predicted and this is part where it gets complex.Similar research has been demonstrated using news and macroeconomic indicators (Elshendy et al., 2017) [77].It is not just about currently available data like history change of stock prices but also textual documents, which can bring new insights into these hybrid approaches.Previous methods in this field did not yield impressive results, and despite low accuracy, changes have been made and model accuracies raised.Schumaker and Chen (2009) [6] focusing to text-mining of financial news articles analysis using words and phrases representations for the purpose of stock market prediction, while Tetlock (2007) [33], Bollen et al. (2011) [35], Hagenau et al. (2013) [37] focused to the relationship between the news, blogs, and social media and the stock market.Combining time series data for stocks and their prices with information gathered from text mining is a key part.This is one of the popular examples of text mining in the financial industry.Time series datasets contain data about a stock event over time, and they lack context, which is tried to fill with text mining techniques.Textual information enriches our base time series dataset by extracting news articles related to stocks of interest and thanks to big data technologies; this complex task could be done in real-time.Textual data have rich information and hypothesis is that company's report or breaking news can affect the stock price.Forums where special topics about the financial world should be covered and where financial experts meet can be a good source of textual data.Sentiment or topic models from previous tasks can be integrated together with time series in the hybrid model.Combination of time series and textual data show improvements in net profit in comparison of just using one of those parts alone (Zhai at al., 2007) [78].
Summarization of textual documents is of great importance for business of any financial institution.Understanding of textual documents, as well as an easy search of those documents, can be achieved with the summarization.Text summarization techniques summarize legal documents in four structures: intro, context, juridical analysis, and conclusion (Farzindar and Lapalme, 2004) [79].To achieve this pre-processing step are used to split the text into chunks (sentence or token) and then annotated with POS tagging step (structures like those that intro, conclusion and other previously mentioned are detected here).Final steps are filtering (removal of unnecessary steps) and selection (high-score units are found) where text mining is done in the last step.Text summarization has two methods: extractive and abstractive where extractive method uses tokens from original documents and creates the summary, while abstractive methods generate completely new tokens to better capture the meaning of the original document.For the purpose of text summarization clustering (Wagh, 2013) [80] can be used for a better search.The pre-processing steps are also needed, such as stemming, and clustering algorithms are used for grouping keywords, phrases or documents in homogenous groups.

Conclusions
By reviewing 123 papers, this paper aims to provide answers to the three research questions, and for that purpose, a qualitative analysis of literature has been conducted using a systematic literature review, citation, and co-citation investigation.
The first research question was answered using the bibliometric analysis.The most important studies with the highest number of citations in the field have been identified, and a brief overview of the themes is given.In addition, papers that are the source of the field have been presented prior to the critical connection with recent studies identified.Based on this, the paper contributed to the existing literature through an overview of the most significant studies published in the Web of Science databases.Research trends have been identified as well.After reviewing the papers, it is possible to conclude that the research focus is on stocks price prediction, financial fraud detection and market forecast utilizing online text mining.The research results reveal that the current research trends of text mining are related to the need to analyze large amounts of data on websites and pages on social media, and to identify and test various text-mining techniques.
The second research question was answered by providing the analysis of techniques for text mining in the financial sector.Analysis of big amounts of data represents the transition to analytic-driven business, conducted by big companies, small enterprises or research teams, in order to identify significant information and transform it into new knowledge.Text analytics or text mining of big data, conducted by various techniques (keyword extraction, named entity recognition, gender prediction, sentiment analysis, topic extraction, and social network analysis) has moved from research centers to real-world institutions, such as financial and banking institutions.
The third research question was answered by the analysis of data sources used for text mining techniques.Results revealed that most of the research focuses on external data sources, such as news and online media posts for the purpose of stock market predictions, and fraud detections.The number of research studies using internal data sources is low.Therefore, the utilization of internal data sources will be a rich source of future research with both theoretical and practical contributions.Various research using internal text sources, such as emails, corporate wikis, financial statements, and project reports could be useful for various purposes, such as human resource management, internal audit, and customer relationship management.In addition, various multimedia files could also be the high-value additional component of text mining analysis (Pouli et al., 2015 [81]; Stai et al., 2018 [82]; Ma et al., 2011 [83]).
The main limitation of our work is the usage of bibliometric approaches to the literature analysis, which has certain limitations.By selecting the database for studies search (Web of Knowledge), specific studies remain invisible to this analysis (Batistič et al., 2017 [12]).
Research results also generate several paths for future research directions.First, more up-to-date outlook to the usage of text mining in finance could be attained with the use of so-called "grey" literature sources, such as case studies, corporate reports, and text-mining software projects (Adams et al., 2017 [84]).Second, usage of text mining in finance should be reviewed according to different decisions that are made based on its results (e.g., tactical, operational and strategic decisions).Taxonomy of various decisions based on text mining in finance could be developed in order to support decision making in a more effective manner, following the work of Gray et al. (2014) [18].Third, characteristics of organizations that have implemented text mining in their business processes should be investigated, with the goal of identifying best-practice approaches, but also obstacles that stand on the way to the successful implementation of text mining in finance.Finally, more in-depth analysis of data sources used for text mining in finance should be conducted, focusing more on the internal documents as the domain of the analysis.

Figure 1 .
Figure 1.Research steps used in this study.

Figure 1 .
Figure 1.Research steps used in this study.

Figure 2 .
Data were extracted from primary studies by manual analysis and extraction of the most used text-mining techniques (Phase 6 and Phase 7).Bibexcel and Pajek software are employed to identify important papers and conduct cocitation analysis of papers indexed in Web of Science (Batistič, Černe and Vogel, 2017) [12], for the purpose of conducting Phase 8 and Phase 9. Citation and co-citation analysis were used in order to detect the most relevant work in the field (Phase 10, and Phase 11).The most cited papers were discussed in relation to the identified text mining techniques in the financial sector (Phase 12).

Figure 2
presents the PRISMA flow diagram, outlining the process of development of the final list of papers included in the analysis, following the practice used in numerous systematic literature research, such as Saura et al. (

Figure 2 .
Figure 2. Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow diagram.

Figure 3 .
Figure 3.Total number of citations in the field on the Web of Science (Social Science Citation Index (SSCI), Science Citation Index Expanded (SCI-EXPANDED), and Emerging Sources Citation Index (ESCI).

Figure 4 .
Figure 4. Co-citation network presenting historical evolution of the field.

Table 2 .
Selected papers dealing with the "keyword extraction" technique in the financial sector.

Table 8 .
The most often used data sources for text mining in the financial sector.