Detecting Weak Signals of the Future: A System Implementation Based on Text Mining and Natural Language Processing

: Organizations, companies and start-ups need to cope with constant changes on the market which are di ﬃ cult to predict. Therefore, the development of new systems to detect signiﬁcant future changes is vital to make correct decisions in an organization and to discover new opportunities. A system based on business intelligence techniques is proposed to detect weak signals, that are related to future transcendental changes. While most known solutions are based on the use of structured data, the proposed system quantitatively detects these signals using heterogeneous and unstructured information from scientiﬁc, journalistic and social sources, applying text mining to analyze the documents and natural language processing to extract accurate results. The main contributions are that the system has been designed for any ﬁeld, using di ﬀ erent input datasets of documents, and with an automatic classiﬁcation of categories for the detected keywords. In this research paper, results from the future of remote sensors are presented. Remote sensing services are providing new applications in observation and analysis of information remotely. This market is projected to witness a signiﬁcant growth due to the increasing demand for services in commercial and defense industries. The system has obtained promising results, evaluated with two di ﬀ erent methodologies, to help experts in the decision-making process and to discover new trends and opportunities. and “photometry”.


Introduction
One of the biggest threats for academia, governments, entrepreneurs and companies is the continuous changes in expanding markets. In fact, companies very often show difficulties predicting these variations and acting on time [1].
Markets in general have proven to be unpredictable environments where making right decisions at the right time is very hard, but undoubtedly, doing so can be translated into better results for an organization.
In addition, social entrepreneurs contribute to creating social wealth by addressing social problems and enriching communities and societies. For this reason, an early discovery of a new problem or opportunity in the market could become vital for the success of the entrepreneurial venture [2].

Background and Related Work
The available studies about weak signals use specific types of sources which are scanned regularly. Most of them are qualitative or unsystematic analyses that are based on a specific model for specific environments. A couple of examples are the identification of signals of the future, related to terrorism or mass transport attacks [22] or master planning London's Olympic legacy [23]; sensor-based human activity recognition [24], mechanical fault prediction by sensing vibration [25], or a deep learning analysis to predict parking space availability [26]. Other examples are the influence of maximization in social networks [27], a model to examine what schools would be like if they were a new invention [28], or a deep analysis about charge prediction for criminal cases [29]. In general, these models cannot be used to scan other environments.
Some studies use the opinion of experts and stakeholders as an input for the detection of weak signals. Among those inputs are the feelings of known critics or the behavior of customers [30].
Other works emphasize the use of structured data that is either owned by an organization or accessible via web, such as, for example, a methodology to detect weak signals for strategic forecasting [31]. Weak signals are extracted from the internal repositories of institutions, or available online sources, where online texts are obtained. Another example identifies signals for a better environment understanding, by using a range of storylines with negative and positive valences [32].
New studies propose the use of text mining techniques from online sources, as web news to predict the future of solar panels [33] or documents from the Journal of the Korean Ceramic Society to scan the field of nanotechnology [34]. These are also some of the first research proposals that carry out quantitative analysis, but both use a single type of data. This quantification is focused on measuring the importance of the visibility of events.
In addition, there is an available study that is using keyword network analysis, betweenness centrality for convergence measurement, and minimum spanning tree (MST), which provides a new view to detect future trends analyzing the changes in a network of words [35], and a study based on text-mining techniques to design multidisciplinary group discussions and generate energy policies and technologies (EP + Ts) in South Korea for the future society, from a multidisciplinary perspective [36]. Regarding the triad of the future sign in the Hiltunen model, these studies focus on only one data source, ignoring the interpretation dimension.
Another available study creates a model to evaluate if free text expressed in natural language could serve for the prediction of action selection, in an economic context, modeled as a game [37]. In this study, a classifier is implemented in a model that considers personality attributes, but can only be used in a specific topic, and only uses pre-determined keywords.
The theoretical foundations and methodologies to detect weak signals are developing research areas, and there is a considerable margin of improvement. In other applications beyond identifying future signals, text mining includes natural language processing tools [8]. As the use of the extraction of multiword expressions is currently providing a good performance in other fields, better results could be expected in the field of future signals.
The process for detecting weak signals is complex, because there are a multitude of sources where they could be masked. Any system implemented for this task must be optimized in both hardware and software to the maximum.
This study describes a system to help experts in the detection of weak signals, by a quantitative analysis of multiple sources applied to the field of remote sensing, considering every word as a keyword, depending only on different types of text documents (unstructured data) used as a dataset. The main contributions are that the system has been designed for any field, using different input datasets of documents, and with an automatic classification of categories for the detected keywords. Table 1 shows a comparison between the main features of the available systems from previous studies and the implemented system proposed in this paper.

Available Systems Implemented System
Mainly qualitative analysis Quantified analysis Specific model for a specific topic Model only dependent on the input dataset Pre-determined keywords All words and multi-words expressions are keywords One single data source and/or expert opinion Three different types of data sources Mainly structured data sources Unstructured data sources (documents and NLP 1 ) This article is structured in the following sections: Section 2 explains, in detail, the design and implementation of the proposed weak signal detection system. In Section 3, the experimental setup Sustainability 2020, 12, 7848 4 of 22 to test the system in the field of remote sensing is defined. In Section 4, the results obtained in the experiment for the remote sensing sector are presented. In Section 5, the main findings and limitations of the system are analyzed. Finally, Section 6 synthesizes the conclusions and the lines of future work.

Description of the Proposed System
In this section, a description of every step of the implemented system for the detection of weak signals is carried out. This system has been designed to detect the three dimensions described by n.
In order to monitor a specific sector, organizations use data repositories that have been created internally and are already available [38,39]. However, for this study, the first step has been the creation of a dataset of information from several online sources. One of the main suppositions of the implemented system is that, in order to adapt it to a different sector, it is only necessary to generate a new dataset with information from that field. The data sources used to create the dataset have been scientific articles, newspaper articles and social media posts.
The collected documents are analyzed, and information is extracted from each one of the words from them. Next, categories are assigned to each one of those words, and text mining techniques are used to detect candidates for weak signals. In the last stage, natural language processing techniques are used to discard false positives and to obtain better results.
The seven stages of the implemented system can be seen in Figure 1 and are explained in the following subsections: (i) Definition of the data input, (ii) Creation of an input dataset, (iii) Extract-Transform-Load (ETL), (iv) Category assignment, (v) Text mining and clustering, (vi) Natural language processing (NLP), and (vii) Interpretation, evaluation and decision making.

Description of the Proposed System
In this section, a description of every step of the implemented system for the detection of weak signals is carried out. This system has been designed to detect the three dimensions described by n.
In order to monitor a specific sector, organizations use data repositories that have been created internally and are already available [38,39]. However, for this study, the first step has been the creation of a dataset of information from several online sources. One of the main suppositions of the implemented system is that, in order to adapt it to a different sector, it is only necessary to generate a new dataset with information from that field. The data sources used to create the dataset have been scientific articles, newspaper articles and social media posts.
The collected documents are analyzed, and information is extracted from each one of the words from them. Next, categories are assigned to each one of those words, and text mining techniques are used to detect candidates for weak signals. In the last stage, natural language processing techniques are used to discard false positives and to obtain better results.
The seven stages of the implemented system can be seen in Figure 1 and are explained in the following subsections: (i) Definition of the data input, (ii) Creation of an input dataset, (iii) Extract-Transform-Load (ETL), (iv) Category assignment, (v) Text mining and clustering, (vi) Natural language processing (NLP), and (vii) Interpretation, evaluation and decision making.

Stage 1: Definition of the Input Data Sources
The signal and issue dimensions depend on the visibility of an event, but the dimension of interpretation depends on the original source of that event. Therefore, to consider the dimension of interpretation, input data should come from documents from different types of sources.
To determine the sources that will be considered in the study, a better understanding of the process of diffusion of a novelty and its potential applications for society is required. Every issue is coded through primary signals called exosignals, that are later interpreted by one or several actors, placing it within a frame of reference to transform it into an interpreter's endosignal, which will be an input signal that will be interpreted again by other actors [14].
A novelty is normally the discovery of a team of engineers and scientists who conduct research. Then, they published their results in a scientific journal. This source is then read by a journalist, who interprets the information and publishes an article in a newspaper or in a web blog. Finally, other users read this post and share the information on social networks, along with their comments, impressions, and sentiments. Figure 2 shows this process.

Stage 1: Definition of the Input Data Sources
The signal and issue dimensions depend on the visibility of an event, but the dimension of interpretation depends on the original source of that event. Therefore, to consider the dimension of interpretation, input data should come from documents from different types of sources.
To determine the sources that will be considered in the study, a better understanding of the process of diffusion of a novelty and its potential applications for society is required. Every issue is coded through primary signals called exosignals, that are later interpreted by one or several actors, placing it within a frame of reference to transform it into an interpreter's endosignal, which will be an input signal that will be interpreted again by other actors [14].
A novelty is normally the discovery of a team of engineers and scientists who conduct research. Then, they published their results in a scientific journal. This source is then read by a journalist, who interprets the information and publishes an article in a newspaper or in a web blog. Finally, other users read this post and share the information on social networks, along with their comments, impressions, and sentiments. Figure 2 shows this process. In order to carry out a detailed monitoring of a sector, three different sources of data have been selected: scientific articles extracted from Science Direct, newspaper articles extracted from the New York Times and posts from the social network Twitter.
The starting point is to create a dataset of scientific documents related to the sector under study. To do this, a search of the selected field must be performed in an academic repository. The source selected is Science Direct, a prestigious and widely used website, which provides access to a large database of scientific and medical research. It hosts over 16 million peer-reviewed articles of content from 4279 academic journals, and covers all study areas: Physical Sciences and Engineering, Life Sciences, Health Sciences and Social Sciences and Humanities. The next step is to detect newspaper articles related to the sector. For this task, The New York Times has been selected to retrieve all articles that mention the sector. The New York Times is an American newspaper with worldwide influence and readership. The newspaper includes an important Science section, in which journalists report about science to the public.
Finally, Twitter has been used as the source for social network posts, because of its flexibility and the numerous recent studies that have incorporated it as a source of knowledge [40][41][42]. In this study, all Twitter posts sharing one of the detected scientific or newspaper articles have been considered, as well as all the responses and comments to these tweets.
This procedure is thus independent on the field under study, as different datasets of documents can be created for different sectors. Therefore, the system is designed to be applicable to every sector, just generating new input datasets.

Stage 2: Creation of an Input Dataset
Once the data input is selected, relevant information from Science Direct and New York Times was collected using an algorithm programmed in Python (v3.2.2, Python Software Foundation, Beaverton, OR, USA). This algorithm is created to download data from their respective websites using Beautiful Soup (v3.2.2, open source), a library to analyze HTML (HyperText Markup Language) data.
The algorithm performs a search of a selected field or sector in Science Direct, and extracts the following information for every peer-reviewed paper identified: title, author, summary, keywords, content, conclusions, year of publication and name of the journal. In addition to thousands of journals, Science Direct also contains more than 30,000 books, but the algorithm focuses only on peerreviewed articles from indexed journals.
The algorithm also performs a search of the same selected field or sector in the New York Times and extracts the following information for every newspaper article: headline, author, section, lead paragraph, content, and year of publication.
After that, the Twitter official API (Twitter Application Programming Interface, v1.1, Twitter, San Francisco, CA, USA) is used to extract every tweet that mentions any of the detected scientific In order to carry out a detailed monitoring of a sector, three different sources of data have been selected: scientific articles extracted from Science Direct, newspaper articles extracted from the New York Times and posts from the social network Twitter.
The starting point is to create a dataset of scientific documents related to the sector under study. To do this, a search of the selected field must be performed in an academic repository. The source selected is Science Direct, a prestigious and widely used website, which provides access to a large database of scientific and medical research. It hosts over 16 million peer-reviewed articles of content from 4279 academic journals, and covers all study areas: Physical Sciences and Engineering, Life Sciences, Health Sciences and Social Sciences and Humanities. The next step is to detect newspaper articles related to the sector. For this task, The New York Times has been selected to retrieve all articles that mention the sector. The New York Times is an American newspaper with worldwide influence and readership. The newspaper includes an important Science section, in which journalists report about science to the public.
Finally, Twitter has been used as the source for social network posts, because of its flexibility and the numerous recent studies that have incorporated it as a source of knowledge [40][41][42]. In this study, all Twitter posts sharing one of the detected scientific or newspaper articles have been considered, as well as all the responses and comments to these tweets.
This procedure is thus independent on the field under study, as different datasets of documents can be created for different sectors. Therefore, the system is designed to be applicable to every sector, just generating new input datasets.

Stage 2: Creation of an Input Dataset
Once the data input is selected, relevant information from Science Direct and New York Times was collected using an algorithm programmed in Python (v3.2.2, Python Software Foundation, Beaverton, OR, USA). This algorithm is created to download data from their respective websites using Beautiful Soup (v3.2.2, open source), a library to analyze HTML (HyperText Markup Language) data.
The algorithm performs a search of a selected field or sector in Science Direct, and extracts the following information for every peer-reviewed paper identified: title, author, summary, keywords, content, conclusions, year of publication and name of the journal. In addition to thousands of journals, Science Direct also contains more than 30,000 books, but the algorithm focuses only on peer-reviewed articles from indexed journals.
The algorithm also performs a search of the same selected field or sector in the New York Times and extracts the following information for every newspaper article: headline, author, section, lead paragraph, content, and year of publication.
After that, the Twitter official API (Twitter Application Programming Interface, v1.1, Twitter, San Francisco, CA, USA) is used to extract every tweet that mentions any of the detected scientific and newspaper articles, together with all comments and retweets. The content of the tweet and the year of publication are stored in the database. There are multiple options for the implementation of the database for the input dataset. NoSQL (non-structure query languages) technologies have emerged to improve the limitations of the available relational databases [43]. These databases are more oriented to texts and are not modeled in means of tabular relations used in relational databases. In addition, as the dataset is composed of documents from different kinds, document-based databases are generally used [44]. The main operations that are performed in the database are the insertion and selection of items. As these operations are being carried out thousands of times, the execution time of these actions needs to be as short as possible [45]. For this reason, MongoDB (v4.0.8, open source) [46] is a very efficient database technology oriented to store documents and perform text mining operations. Appendix A includes some computational execution times of the system.

Stage 3: Extract, Transform and Load (ETL)
The next step is the design of a data warehouse to store a large volume of information [47]. Each of the words that make up the documents collected in the previous stage is extracted and treated as a keyword, storing the following information items in the warehouse: source document, the number of occurrences, the year of publication and the source. The only words that are not taken into account are stopwords, which are words that do not have a specific meaning, such as articles, pronouns, prepositions, etc. They are filtered before the processing of natural language data, using the Natural Language Toolkit (NLTK, v3. 4.5, open source) leading platform in Python.
Along with the elimination of stopwords, a stemming phase is performed to eliminate the suffixes of each word to obtain their root. Snowball, a small string processing language designed for creating stemming algorithms, is used to implement Porter's process [48]. In this way, the number of insertions in the data warehouse is reduced. The word that represents the word group is the one that appears most frequently in all the documents in the dataset.

Stage 4: Category Assignation
In Stage 3, a group of categories is designated for every keyword. Two category assignation processes have been carried out. The first one is the assignation of representative layers. The second one is the automatic assignation of categories based on the topics of the documents where the keyword appears.
In the first process, keywords were classified into different layers: "environmental and sustainability factors", "business needs" and "technological components". For example, keywords such as "war" or "oil" were classified as environmental factors, "portable" or "cheap" as business needs, and "batteries" or "hybrid" as product/technological components [33].
The second process is to automatically designate several categories for every word. These categories are automatically assigned, considering keywords, topics, and Special Issues of the documents of the scientific journals, where the word detected as a weak signal is present.
Although some studies use standard category lists [49], one of the main advantages of assigning categories, automatically and dynamically, is that they only depend on the input dataset, therefore being the most relevant ones for the field of study.

Stage 5: Text Mining
Weak signals generally carry "information on potential change of a system towards an unknown direction" [50].
However, they are very difficult to detect because they could be considered as noise (the trends of these terms are so imperceptible to experts because they do not easily seem to follow any pattern). If these signals exceed a threshold, they become strong signals, known to a vast majority.
In conclusion, weak signals currently exist as small and seemingly insignificant issues that can tell experts and organizations about the changes in the future. A wild card is defined as a surprising event that will have significant consequences in the future [14]. When this event is produced, the weak signal Sustainability 2020, 12, 7848 7 of 22 becomes a strong one. Figure 3 shows the process of weak signals becoming strong ones. The goal of this study is to detect weak signals as early as possible to extend the time to react. Sustainability 2020, 12, x 7 of 21 event that will have significant consequences in the future [14]. When this event is produced, the weak signal becomes a strong one. Figure 3 shows the process of weak signals becoming strong ones. The goal of this study is to detect weak signals as early as possible to extend the time to react. The main difference between text and data mining is that the first one is applied to extract information from textual documents instead of structured sources [51]. The detection of these wild cards is similar to the use of data mining techniques to extract patterns in time series [52]. The main problem is the proper detection of a change point, that is, the identification of the time points where the behavior change is produced. Although all methods lose precision by decreasing the signal/noise ratio, the Batch algorithm offers better results when the dataset precedes the analysis, as in the case under study.
As previously described, the three components of the semiotic model need to be identified: signal, issue and interpretation.
The signal dimension is related to the absolute number of appearances of every word [14]. To measure this dimension, the degree of visibility (DoV) is established. First, a value is set by the ratio of the number of appearances and the total number of documents. Then, a factor is introduced to give more importance to the most recent appearances, giving a different weight for every period of a year. In this case study, a dataset of more than fifty thousand documents has been divided into eleven periods of one year (from 2007 to 2017). To carefully set the multiplying factor, a group of business experts has been consulted, defining tw as a time weight of 0.05. The description of the DoV of a word i in a period j is shown in Equation (1).
TFij is the number of appearances of the word i in period j, NNj is the total number of documents in the period j, while n is the number of periods and tw is a time weight.
The issue dimension is related to the total number of documents where the keyword appears [14]. To measure this dimension, the degree of diffusion (DoD) is established. As in the previous equation, the first step is to set the ratio of the number of documents which contained the keyword and the total number of documents. Then, a factor is introduced to give more importance to the most recent The main difference between text and data mining is that the first one is applied to extract information from textual documents instead of structured sources [51]. The detection of these wild cards is similar to the use of data mining techniques to extract patterns in time series [52]. The main problem is the proper detection of a change point, that is, the identification of the time points where the behavior change is produced. Although all methods lose precision by decreasing the signal/noise ratio, the Batch algorithm offers better results when the dataset precedes the analysis, as in the case under study.
As previously described, the three components of the semiotic model need to be identified: signal, issue and interpretation.
The signal dimension is related to the absolute number of appearances of every word [14]. To measure this dimension, the degree of visibility (DoV) is established. First, a value is set by the ratio of the number of appearances and the total number of documents. Then, a factor is introduced to give more importance to the most recent appearances, giving a different weight for every period of a year. In this case study, a dataset of more than fifty thousand documents has been divided into eleven periods of one year (from 2007 to 2017). To carefully set the multiplying factor, a group of business experts has been consulted, defining tw as a time weight of 0.05. The description of the DoV of a word i in a period j is shown in Equation (1).
TF ij is the number of appearances of the word i in period j, NN j is the total number of documents in the period j, while n is the number of periods and tw is a time weight.
The issue dimension is related to the total number of documents where the keyword appears [14]. To measure this dimension, the degree of diffusion (DoD) is established. As in the previous equation, the first step is to set the ratio of the number of documents which contained the keyword and the total number of documents. Then, a factor is introduced to give more importance to the most recent appearances, giving a different weight for every period of a year. This multiplying factor is set, following the same considerations than in the previous case. The description of the DoD of a word i in a period j is shown in Equation (2).
DF ij is the number of texts where the word can be found. Future signals that are candidates for weak signals have an absolute low number of occurrences, but a high fluctuation (a high geometric mean of DoD/DoV but low number of occurrences).
With the calculation of the increase ratios for every word in every document from the input dataset, two graph maps can be generated: the "Keyword Issue Map (KIM)", a map of DoD with the absolute number of appearances of every word; and the "Keyword Emergence Map (KEM)", a map of DoV with the number of texts, where every word can be found.
The structure of these two maps is shown in Figure 4. Above a threshold in the time weighted increasing rate axis, two clusters can be identified: the "Strong Signals" area is above an average frequency threshold, and the "Weak Signals" area is below. Below a line on the Y axis, words are identified as noise. This last cluster consists of terms that should be discarded because their appearances are not increased through time, or simply, if their appearances are studied, they do not follow any pattern at all. appearances, giving a different weight for every period of a year. This multiplying factor is set, following the same considerations than in the previous case. The description of the DoD of a word i in a period j is shown in Equation (2).
DFij is the number of texts where the word can be found. Future signals that are candidates for weak signals have an absolute low number of occurrences, but a high fluctuation (a high geometric mean of DoD/DoV but low number of occurrences).
With the calculation of the increase ratios for every word in every document from the input dataset, two graph maps can be generated: the "Keyword Issue Map (KIM)", a map of DoD with the absolute number of appearances of every word; and the "Keyword Emergence Map (KEM)", a map of DoV with the number of texts, where every word can be found.
The structure of these two maps is shown in Figure 4. Above a threshold in the time weighted increasing rate axis, two clusters can be identified: the "Strong Signals" area is above an average frequency threshold, and the "Weak Signals" area is below. Below a line on the Y axis, words are identified as noise. This last cluster consists of terms that should be discarded because their appearances are not increased through time, or simply, if their appearances are studied, they do not follow any pattern at all. The third dimension is interpretation, and this component is related to how the type of document influences the transmission of the signals. The importance of scientific journals has been selected to measure this dimension, because the first exosignals generated are usually the result of a research published in this type of documents.
All available bibliometric indexes for scientific journals have their advantages and limitations, but in general, studies show a high correlation between them, especially in the top of their rankings [53]. There are several journal indexes, such as the Journal Impact Factor (JIF), SCImago Journal Rank or SJR indicator, Eigenfactor Metrics, Scopus h-index or Google h-index. Documents published in The third dimension is interpretation, and this component is related to how the type of document influences the transmission of the signals. The importance of scientific journals has been selected to measure this dimension, because the first exosignals generated are usually the result of a research published in this type of documents.
All available bibliometric indexes for scientific journals have their advantages and limitations, but in general, studies show a high correlation between them, especially in the top of their rankings [53]. There are several journal indexes, such as the Journal Impact Factor (JIF), SCImago Journal Rank or SJR Sustainability 2020, 12, 7848 9 of 22 indicator, Eigenfactor Metrics, Scopus h-index or Google h-index. Documents published in journals with a high factor are more influential, and therefore, could accelerate the transformation of weak signals into strong.
Although Impact Factor is widely used, it does not consider that the source of the citations and journals that publish a large amount per year have less potential for a high Impact Factor. Clarivate Analytics has added the Emerging Sources Citation Index to extend the scope of publications in the Web of Science, but this index classifies scientific papers from 2015, and this study is using a dataset of documents from 2005. SCImago Journal Rank indicator is based on the SCOPUS database, which is bigger than the Journal Impact Factor, and places more emphasis on the value of publishing in top-rated journals than on the number of citations of a publication. In addition, SCImago has taken its database and calculated the H-index of every journal, providing a relevant value within the field of research.
The degree of transmission (DoT) is measured, considering all h-index values from the journals from all texts where the word i can be found, as shown in Equation (3). This index has been selected for several reasons: (i) it is based on a bigger database than Journal Impact Factor, (ii) it is freely available, which means it is easier to access to its content, helping the transmission of new changes, and (iii) unlike citations-per-article measures like Impact Factor, it is not skewed by a small number of individual, highly cited articles.
The values of the interpretation are graphically expressed with different sizes for each dot (which represents a keyword) in both KEM and KIM maps. For the final consideration of terms related to weak signals, every DoD and DoV are multiplied by their DoT. This way, scientific journals have a higher weight in the detection of weak signals than the other sources.

Stage 6: Natural Language Processing (NLP): Multi-Word Expressions
The words in both maps are possible terms related to weak signals. As a result, all keywords not detected in both maps are discarded. However, it is hardly ever possible to extract valuable information with just a single word, considering that a single word can have several meanings, or at least, be connected to several sub-issues.
Natural language processing is widely used in controlled environments [54], but in this study, it is used as an additional stage to improve the quality of the information selected. As the system is only depending on the input dataset, NLP techniques are applied in no controlled environments, considering multi-word expressions.
In conclusion, the next step is a multi-word expression analysis, a natural language processing technique that will help obtain more accurate results, as is shown in previous studies [55,56]. The analysis is performed in the list of detected terms from the text mining analysis from the previous step. The study involves the first words immediately preceding and following the identified term in every appearance, but discarding all stopwords.
The result of the process is a network of expressions related to a keyword, ranked by their overall popularity.

Stage 7: Interpretation, Evaluation and Decision-Making
As a result of the whole system, experts and other stakeholders will have access to four outputs that will help them in the decision-making process:

1.
A list of potential weak signals represented in the Keyword Issue Map, depending on their Degree of Diffusion and Degree of Transmission.

2.
A list of potential weak signals represented in the Keyword Emergence Map, depending on their Degree of Visibility and Degree of Transmission.

3.
A ranking of all the keywords present in both graphs, which are more likely to be connected to weak signals.

4.
The results of the multi-word analysis, providing more accurate results to discard false signs.

Definition of the Experiment for Remote Sensing Sector
Once the parts of the proposed system have been described in the previous section, an experiment has been defined to test it. As previously stated, the system can be applied to any sector, because it is only dependent on the input dataset of documents. For this study, the sector that has been chosen is remote sensing. The main reason is because the global market for services related to remote sensing is in a huge expansion and will reach US $7 billion by 2024, due to applications that require the exploitation of satellite image data for companies and governments, such as disaster prevention, weather forecast or agriculture [57].
Remote sensing services are providing new applications in observation and analysis of information at remote locations by means of airborne vehicles, satellites, and on-ground equipment [58]. Remote sensing solutions have multifaceted social applications, such as the mapping of open water surfaces, soil moisture mapping, surface movements, data mosaic and satellite maps, geology and mineral resources, urban and rural development, disaster management support, and climate change studies, among others.
The remote sensing market is projected to witness a significant growth due to the increasing demand for remote sensing services in our society. This growth of remote sensing services is attributed to the effective and flexible data-gathering from remote locations without being physically there [59].
In conclusion, as there are currently many possibilities for social entrepreneurs and companies to work on remote sensing applications, the system has been applied to facilitate the prediction of the future impact of new technologies by the detection of weak signals.
To obtain the input data for the system, a search of the term "remote sensing" was performed in Science Direct and New York Times sources between 2007 and 2017. As recent appearances have more relevance in the weak signal analysis, documents prior to 2006 were not considered because their contribution to the results is negligible.
Although Science Direct also contains electronic books, only peer-reviewed scientific articles were considered in the study. The Python algorithm downloaded the required information from both websites and stored them in databases, as described in the previous section. After that, all tweets sharing all the detected documents from Science Direct and the New York Times were considered, together with all comments and responses from other users in the social network to those tweets.
As a result, more than 43,000 ScienceDirect scientific articles, 1800 New York Times newspaper articles and 59,000 Twitter tweets between 2007 and 2017 were extracted and divided into 11 groups of documents from every year for the analysis. Document distribution by type and year is shown in Figure 5.
Once the input dataset was created, the additional stages of the system were performed. First, a data warehouse was created to store information from every word of every document: source document, number of occurrences and year of publication of the source. As previously stated, all stopwords were discarded and a Porter's process of stemming was carried out. After this, the steps of category assignation and text mining were developed. As a result, a list of keywords related to weak signals was obtained, and both Keyword Issue and Emergence Maps. The next step was to carry out the multi-word expressions analysis, with the detected words from the previous steps.
sharing all the detected documents from Science Direct and the New York Times were considered, together with all comments and responses from other users in the social network to those tweets.
As a result, more than 43,000 ScienceDirect scientific articles, 1800 New York Times newspaper articles and 59,000 Twitter tweets between 2007 and 2017 were extracted and divided into 11 groups of documents from every year for the analysis. Document distribution by type and year is shown in Figure 5.

Definition of the Evaluation Methods
In a quantitative analysis, words related to weak signals are expected to show a low absolute number of occurrences, but a high range of fluctuation [14]. The graph of DoV and DoD of a weak signal shows this behavior. A first analysis has been carried out to check that all detected weak signals follow this behavior.
In addition, two different methodologies have been used to evaluate the consistency of the algorithm. The first one consists of generating an additional dataset with the segment of documents for the years 2018 and 2019. Then, it is possible to compare which weak signals have become strong signals, knowing that no documents from these years are in the dataset of the experiment carried out to detect them.
The second method consists of consulting a group of experts to know if there is a match between their opinion and the obtained results [32]. A group of five experts in remote sensing from different institutions were interviewed to compare the results of the test with their predictions.

Results
In this section, the output results of the application of the implemented system applied to remote sensing will be described. Computing information about this experiment is shared in Appendix A.

Keyword Issue Map (KIM) for Remote Sensing
The two main factors that have been considered to configure the Keyword Issue Map are the number of documents where the term appears, and the geometric mean of an average time-weighted increasing rate of this frequency.
These potential weak signals have in common a low frequency of documents where the word can be found, but a high increasing rate. If a word has a high frequency of documents, the word is connected to a strong signal.
These two factors are used to measure the issue component of the sign from the semiotic model of the future sign.
A group of 248 words has been detected, that belong to the cluster of potential weak signals according to the criteria of their degree of diffusion, or DoD.
In addition, the interpretation dimension of the sign is measured using the impact factor or h-index of every word, giving a different size of every dot according to this value.
The clustering algorithm has determined a threshold of 104.54 of average document frequency. Above this threshold, there are no words considered as weak signals. Figure 6 shows the graph of the Keyword Issue Map that has been generated in the test of the system.

Keyword Emergence Map (KEM) for Remote Sensing
The two main factors that have been considered to configure the Keyword Emergence Map are the number of appearances of a term, and the geometric mean of an average time-weighted increasing rate of this frequency.
These potential weak signals have in common a low frequency of occurrences of the word, but a high increasing rate. If a word has a high frequency of occurrences, the word is connected to a strong signal. These two factors are used to measure the signal component of the sign from the semiotic model of the future sign.
A group of 233 words has been detected, that belong to the cluster of potential weak signals, according to the criteria of their degree of visibility, or DoV.
In addition, the interpretation dimension of the sign is also measured using the Impact Factor or H index of every word, giving a different size of every dot according to this value.
The clustering algorithm has determined a threshold of 616.01 of average frequency of occurrences. Above this threshold, there are no words considered as weak signals. Figure A1 in Appendix B shows the graph of the Keyword Emergence Map that has been generated in the test of the system.

Detected Terms as Potential Weak Signals
A list of 87 words from these two maps fulfill both requirements of low frequency of occurrences and low frequency of documents where the word can be found:
Some of the terms identified in both the Key Emergence Map and the Key Issue Map are shown in Table 2. This table shows a list of some of the keywords detected, divided into three different layers (business needs, environmental or geographical factors and product/technological components), their values of frequency of appearances (DoD and DoV) measured using Equations (1) and (2) applied to the whole period of 11 years, their normalized time-weighted increase ratios in the whole period, and its degree of transmission (DoT) measured using Equation (3). The table also shows the categories that have been assigned automatically by the system, in the third stage of the system. Some detected categories for remote sensing are climate change, meteorology, geography, water research, space, or agriculture. The pattern of the degrees of visibility and diffusion of the word "desertification", one of the terms identified with a weak signal in the field of remote sensing, is shown in Figure 7. All graphs of DoV and DoD from every detected word show an abnormal pattern with a low frequency.

Results of the Multi-Word Analysis
Some of the detected words from the text mining analysis were selected to perform a multi-word analysis with them, with the objective of obtaining more accurate information.
A few expressions identified as potential weak signals are "climate engineering", "biosatellite

Results of the Multi-Word Analysis
Some of the detected words from the text mining analysis were selected to perform a multi-word analysis with them, with the objective of obtaining more accurate information.
A few expressions identified as potential weak signals are "climate engineering", "biosatellite engineering", "spectral splitting", "dioxide splitting", "adaptative encoding", "spectroscopic voxel", "cooperative coevolution", "desertification reversion", "global desertification", "ground photometry", "urban sprawl", "West Africa" or "UVSQ-Sat". Figure 8 shows the words that have a higher correlation with the terms "desertification" and "photometry", two of the detected terms related to work signals. The figure below every term is the percentage of times that the word appears beside "desertification" or "photometry" in all the documents of the dataset.

Results of the Multi-Word Analysis
Some of the detected words from the text mining analysis were selected to perform a multi-word analysis with them, with the objective of obtaining more accurate information.
A few expressions identified as potential weak signals are "climate engineering", "biosatellite engineering", "spectral splitting", "dioxide splitting", "adaptative encoding", "spectroscopic voxel", "cooperative coevolution", "desertification reversion", "global desertification", "ground photometry", "urban sprawl", "West Africa" or "UVSQ-Sat". Figure 8 shows the words that have a higher correlation with the terms "desertification" and "photometry", two of the detected terms related to work signals. The figure below every term is the percentage of times that the word appears beside "desertification" or "photometry" in all the documents of the dataset.

Evaluation of the Results
In this experiment, some strong signals were detected in the period of 2018 and 2019, based on five scientific papers for 2018 and 2019 from the journal Remote Sensing, which showed the importance of West Africa [60][61][62][63][64][65], which was one of the expressions detected as a weak signal in the experiment related to the remote sensing sector. The expression "UVSQ-Sat", which means "UltraViolet and infrared Sensors at high Quantum efficiency onboard a small SATellite", detected as a candidate of weak signal and now a strong signal, appears for the first time in the title of a scientific paper of 2019 [66]. Finally, other detected keywords, such as "NOAA (National Oceanic and Atmospheric Administration)", "InSAR (Interferometric synthetic aperture radar)", "Rosetta" and "SRTM (Shuttle Radar Topography Mission) are also becoming strong signals in the last two years.

Evaluation of the Results
In this experiment, some strong signals were detected in the period of 2018 and 2019, based on five scientific papers for 2018 and 2019 from the journal Remote Sensing, which showed the importance of West Africa [60][61][62][63][64][65], which was one of the expressions detected as a weak signal in the experiment related to the remote sensing sector. The expression "UVSQ-Sat", which means "UltraViolet and infrared Sensors at high Quantum efficiency onboard a small SATellite", detected as a candidate of weak signal and now a strong signal, appears for the first time in the title of a scientific paper of 2019 [66]. Finally, other detected keywords, such as "NOAA (National Oceanic and Atmospheric Administration)", "InSAR (Interferometric synthetic aperture radar)", "Rosetta" and "SRTM (Shuttle Radar Topography Mission) are also becoming strong signals in the last two years.
Regarding the group of experts, they indicated the following statements: 1.
The growth of remote sensing services is attributed to the effective and flexible data-gathering, thanks to highest resolutions of the metrics, cloud computing software and machine learning techniques. Several terms, such as "adaptative encoding" or "voxel", were detected as related to weak signals.

2.
Among the outstanding applications, agriculture and especially desertification, are areas in which remote sensors will be more relevant. Desertification and other terms related to agriculture are keywords that the algorithm identified is related to weak signals.

3.
Interferometric synthetic aperture radar, abbreviated "InSAR", which is a radar technique used in geodesy and remote sensing, is becoming more and more important. InSAR is a keyword that the algorithm identified as related to weak signals.

4.
West Africa is becoming one of the most interesting areas in the world for remote sensing applications. "West Africa" is an expression that the algorithm identified as related to weak signals [60][61][62][63][64][65].

Discussion
The main findings of the obtained results and the main limitations of the methodologies will be described in this section.

Main Findings
The input dataset that was generated consists of multiple documents about the remote sensing sector that were extracted from three different types of sources. There are a small amount of newspaper articles compared with the amount of scientific papers, which means that newspapers filter the content of scientific journals, publishing only relevant results and their applications [67]. However, the large number of tweets found shows that society has special interest in the applications of this sector.
Previous studies such as Koivisto et al. [22], Yoon [33] and Griol-Barres et al. [49] have only detected weak signals related to terms within standard lists of keywords. The results show that several terms which rarely could be found on standard databases can be detected by this system, providing more reliable results. The results of the experiment show that the system can detect words such as "ENSO (El Niño-Southern Oscillation)", "NOAA (National Oceanic and Atmospheric Administration)" or the Chinese region of Wuhan, because every word in every document is processed as a potential keyword.
The keywords related to weak signals are isolated terms that do not provide enough information for a rich analysis. The automatic assignation of categories to the weak signals detected provides more useful information. For instance, the word "Africa" was detected as one related to weak signals. The Classification stage provided the categories of "climate change" and "water research" for the word "Africa", giving additional information about applications and subsectors where this word is becoming more relevant. As previously stated, these categories are obtained automatically, considering the names, keywords, topics, and Special Issues of the scientific articles and journals where the word "Africa" is present.
A multi-word analysis was performed to detect expressions that provide more accurate information that can be useful for experts and other stakeholders in their decision-making processes. For this process, only the 87 terms detected were considered. For instance, "West Africa" was a popular expression detected, providing a bit more information about the weak signal. Another example is the term "desertification", that refers to a problem that can be tackled by remote sensing applications, while the multi-word expression "desertification reversion" refers to new opportunities to solve the problem.
After obtaining the results, the next step is to evaluate and validate them. In standard applications of neuronal networks, the dataset is usually divided into subsets to train and test the network. However, this is not recommended in applications to detect signals of the future applications, due to the low frequency of occurrences of words related to weak signals. In a previous study related to weak signals [33], results were validated, with a deeper evaluation of the same documents of the dataset that was used for the test. This is also not recommended because results should be confirmed with external factors and not by using the same input dataset.
In another study [37], a machine learning technique known as transductive learning was used to train the system, exposing the algorithm to all the examples (i.e., all written texts), but using specific labels (i.e., actions selected) for a subset of these examples. The main difference with this study is that, in this proposal, labels or categories are built automatically to build a system than can be used in any sector.
Two evaluation methods have been performed. The first method of evaluation has been carried out in previous works [33,49], and is based on the detection of strong signals in a new input dataset, with more recent documents than the ones considered in the experiment. In this method, the presence of strong signals in recent documents that were detected as weak signals in earlier references is considered a confirmation of the success in weak signal detection. To confirm the experimental results of the experiment, a new test was conducted with a dataset of documents from 2018 and 2019 related to remote sensing; a group of the detected weak signals have become strong signals. The second method of evaluation was to leverage relevant indicators from a group of experts in the field. The results have confirmed that the weak signals detected in our experiments matched the indicators identified by a panel of experts in the field of remote sensing. In conclusion, the obtained results and the methodologies applied in the evaluation show interesting and promising points that could be useful to detect new opportunities to entrepreneurs and other stakeholders, but the study also shows several limitations which will be addressed in future work.

Limitations
Although the study results are interesting, the experiment has confirmed that single word expressions do not generally provide relevant information for a better understanding of future trends. Furthermore, although systems to detect weak signals have been evaluated in various sectors [68,69], new studies in other fields should be carried out, to continue testing whether the proposed system is portable to other domains.
The lists of detected words in both the Keyword Issue and Emergence Maps present promising results, but they can also contain false positives that are not really connected to weak signals. A cost-effective method to avoid this situation has been adopted, by considering the words that appear in both lists, but this approach has the limitation of discarding some weak signals.
The automatic assignation of categories and applying natural language processing techniques such as the multi-word expression analysis are providing more useful information to entrepreneurs and other stakeholders. However, to improve the system to provide clearer and definitive guidance for decision-making, more natural language processing techniques could be applied, such as bag of words recognition, regular expressions, or sentiment analysis [70], which will be tested in our future work.
The evaluation techniques applied to the system also present several limitations. The main problem of using a more recent input dataset to validate the results is that this can only be performed when a new dataset of documents is available. The conclusion is that this limitation can be assumed, as the study is concerned with the prediction of future trends, and consequently, the only fully reliable check method is to wait until that future becomes the present. On the other hand, using a group of experts to evaluate the results allows an immediate assessment, but has the limitation that they could discard some weak signals when they are out of their field of knowledge.
Despite these limitations, the performed experiment and its evaluation show that the implemented system is reaching good reliability, according to the opinion of the group of experts in remote sensing, and the dataset of documents from 2018 and 2019.

Conclusions
This study describes the design, implementation, and evaluation of a system to identify weak signals of the future. This system is designed to help experts, startups and other stakeholders in their decision-making processes, due to the difficulty of scanning complex environments full of textual information.
The system has been designed under the constraints of high efficiency, the use of multiple types of input sources, data sources of unstructured textual documents, its applicability in every field of study, and to extract conclusions based on quantitative results that are not dependent on the opinion of experts [71]. The experimental setup of the system includes a dataset of documents from 2007 to 2017 related to the field of remote sensing. Two methodologies to evaluate the experiment (new dataset with documents from 2018 and 2019 and a group of experts) were applied, with promising results that can be used by entrepreneurs and other organizations in their decision-making processes.
The system is only dependent of its input dataset of documents. If the system is tested in a different sector, it is only necessary to obtain a different input dataset with documents related to that sector, following the instructions explained in Section 2. In addition, although some studies use standard category lists [49], one of the main advantages of assigning categories automatically and dynamically is that they are only dependent on the input dataset, therefore being the most relevant ones for the field of study. In conclusion, the system can be used to perform the detection of weak signals of the future in multiple sectors. The outputs are obtained by a quantitative scoring methodology that does not depend on the opinion of experts.
Finally, in future research, the system will be tested in other sectors, and new natural language processing techniques will be applied.
RAM: 16 GB DD4 The execution times show that the parallel GPU implementation is up to nine times faster compared with the standard sequential alternative using CPUs. The reason why the GPU rendering obtains this optimization is because GPUs have a large number of simple cores which allow parallel computing through thousands of threads computing at a time, and the system requires multiple operations related to every word in every document that can be easily executed in parallel. Table A1 shows the execution times of the experiment. These results show that, despite the large number of documents processed, the system has been designed in an efficient way so that the calculation can be performed on equipment with medium-level hardware, and, therefore, on an equipment that could be affordable for entrepreneurs and SMEs.
Finally, Table A2 shows a benchmark of operation performance for different database technologies [45]. After this study, the database technology selected to store the data extracted from documents was MongoDB [46], as shown in the definition and implementation of the system.