A Natural Language Processing Approach to Social License Management

: Dealing with the social and political impacts of large complex projects requires monitoring and responding to concerns from an ever-evolving network of stakeholders. This paper describes the use of text analysis algorithms to identify stakeholders’ concerns across the project life cycle. The social license (SL) concept has been used to monitor the level of social acceptance of a project. That acceptance can be assessed from the texts produced by stakeholders on sources ranging from social media to personal interviews. The same texts also contain information on the substance of stakeholders’ concerns. Until recently, extracting that information necessitated manual coding by humans, which is a method that takes too long to be useful in time-sensitive projects. Using natural language processing algorithms, we designed a program that assesses the SL level and identiﬁes stakeholders’ concerns in a few hours. To validate the program, we compared it to human coding of interview texts from a Bolivian mining project from 2009 to 2018. The program’s estimation of the annual average SL was signiﬁcantly correlated with rating scale measures. The topics of concern identiﬁed by the program matched the most mentioned categories deﬁned by human coders and identiﬁed the same temporal trends. topics from both methods related to the establishment of collaborative relations. The second-order polynomial are included to give a picture of the to which the relations were and were not correlated. The measures in panel


Introduction
Large complex projects generate diverse social and political impacts on a diverse network of stakeholders that itself is constantly evolving across the life cycle of the project [1]. Managing projects in a sustainable manner requires an awareness of the multiple linkages the project creates and extinguishes in the social, economic, and political ecosystem in which it is embedded [2]. Indeed, the continuance of the project itself can be called into question by stakeholder coalitions whose concerns have been overlooked or dismissed [3]. Early detection of, and response to, stakeholder discontent can make the difference between costly delays and smooth progress [4]. This paper describes text analysis technology that can unlock the access to stakeholder concerns at a pace quick enough to be useful in a dynamic project management environment.

Time-Sensitive Insights Buried in Texts
The extractive industries have conceptualized stakeholder acceptance of their projects as the social license (SL) [5]. Quantitative SL measures were developed using rating scales in personal interviews [6]. During the same interviews, open-ended questions solicited information on the concerns and priorities of each stakeholder [7]. However, the analysis of the texts of the answers has traditionally relied on content analysis methods, which can take months, depending on the number of interviews and the depth of the analysis [8,9]. The delay can make results outdated by the time they are delivered.
The advent of internet communications has opened new vistas for the discovery of stakeholders' issues and discontents, especially for projects affecting populations where internet usage is high. To the traditional text sources of interview transcripts, transcripts of broadcasts, and written letters and reports, project managers can now also get texts of stakeholder opinions from social media, transcriptions of podcasts, and webpages. As a result, large complex projects can now obtain enough texts to use sentiment analysis for the ongoing estimation of their social license level. Moreover, natural language processing (NLP) algorithms can be applied to the same texts to identify the stakeholders' issues and concerns, producing results similar to content analysis, but in much less time. This paper presents data relevant to the empirical validation of a program designed to do both these things.

Social License Measurement
The term "social license to operate" first gained currency in the mining industry [10] as a metaphor comparing community acceptance or rejection with legal permits and licenses. Joyce and Thomson [11,12] defined the SL as the level of acceptance or approval granted to a project by its stakeholders. The organization being granted the social license is known as the focal organization. Morrison [13] expanded the SL concept noting that all types of organizations need a social license for their activities, be they non-government organizations, private companies, or governments. Thus, the concept applies to equally to temporary project-focused consortia. The stakeholders who grant the SL are defined in Freeman's stakeholder theory of strategic management [14] as an individuals, groups, or organizations that either (a) are, or could be, affected by the focal organization, or that (b) could choose to have an effect upon it. Thus, those who choose to publicly praise or denounce a project on social media meet the definition of stakeholders. It is possible to include the general public as a stakeholder to the extent that their collective opinions would affect public policy regarding a project [15].
Boutilier [6] developed a quantitative measure of the SL based on agree/disagree ratings of statements. Using data from personal interviews with stakeholders of a mine, Boutilier found that the manual coding of texts from open-ended answers for praise versus criticism correlated highly with the quantitative measure. This article takes the next logical step by hypothesizing that the results of sentiment analyses performed by machine learning algorithms would correlate with Boutilier's rating scale-based SL measure. If so, it may then become possible to estimate the level of SL from any adequate sample of texts, including social media, mainstream media articles, and blogs. At the same time, we analyze those texts again with the aim of extracting the topics of discussion that they contain. This application of topic analysis methods promises to reduce the time required to produce results that would take too long to obtain using manual content analysis.
In the Materials and Methods section, we briefly introduce the literatures on the NLP techniques known as sentiment analysis and topic analysis. Then, we describe the validation data used to validate the program. It consists of rating scale social license scores and manually coded texts from interviews with the stakeholders of a mining project in Bolivia at eight points over a ten-year period. Next, we perform simple statistical analyses to compare the manual method findings with the program's output. Finally, we discuss the limitations of the study and the questions raised for future research.

Materials and Methods
NLP is a computational methodology for processing textual data in a way that can be interpreted and analyzed by a computer [16,17]. Since a computer cannot understand the inherent meaning behind individual words or phrases, NLP works by looking for patterns in the words that are frequently used together, and the user-defined meaning ascribed to a given combination of words. As in other areas of machine learning (ML), there are two main types of techniques that can be employed for computationally finding meaning in patterns: supervised learning, in which the user provides examples that are labeled according to their meaning categories, and unsupervised learning, in which the computer establishes its own boundaries between clusters of data, based on proximity of data points, relative frequency of values, and other quantitative criteria [18]. The program presented here, called the Social License and Controversy Detector and Analyzer (SLaCDA), uses both types of learning to accomplish the goals of sentiment classification and topic identification, respectively.

Sentiment Analysis
Sentiment analysis is a type of supervised machine learning NLP technique in which a set of pre-classified training data is used by the machine to learn rules that it can use to correctly replicate the classification of a given piece of text [19][20][21]. In the context of sentiment analysis, many samples of data can be classified by human coders as either positive or negative (or praise or criticism). These data are given to a set of algorithms that are then "trained" by guessing the category assigned to a given piece of text. The algorithms are able to revise future guesses based on whether previous guesses were correct or incorrect. Over time, the algorithms learn which combinations of words are typical of a negative or positive sentiment, and they can be used to automatically report the sentiment of a text that has not been classified by humans.
Since the idea of "positive" and "negative" sentiment can be subjective and depend on the speaker, the target, the subject matter, and several other factors, the data used to train the algorithms are domain specific [22]. This means that a general corpus of positive and negative sentiments about, say, movies or products from an online store, will not perform well when asked to produce sentiments for text from a corpus of comments about a resource development or infrastructure project. For that reason, SLaCDA's training data come from the comments of stakeholders in previous surveys designed to measure the SL of a specific focal organization or industry. It should also be noted that the dependence of sentiment analysis algorithms on frequencies and associations with specific words out of context means that it is notoriously difficult to detect sarcasm or irony using these algorithms [23]. This is less of a problem for texts gathered from in-person interviews than, say, social media posts, because casual online opinions are more prone to the use of sarcasm [24] than in-person, topic-specific interviews.
There are several useful machine learning algorithms that have been developed by computer scientists over the years, and the choice of which to use depends largely on the goals of the learning and the type and amount of data available for training [25]. These constraints led the authors to use linear, multinomial, and Bernoulli naïve Bayes algorithms, stochastic gradient descent, and support vector machines, which were implemented using the sklearn package in the Python language [26]. The details of each of these algorithms are beyond the scope of this paper and will not be given here, but in general, the Bayes algorithms use Bayesian statistics to classify texts under the assumption that the distribution or probability of features (words) in a given category follow a Gaussian, multinomial, or binomial (Bernoulli) distribution [27,28]. Stochastic gradient descent is an optimization algorithm that seeks to minimize error by iterating over randomized samples of the training data until it converges [29]. Support vector machines map data points onto spatial values and look for gaps between physical clusters of data [30]. In addition to these algorithms, a voted classifier was used to settle disagreements between the classification algorithms by allowing each algorithm a single vote (positive or negative) and reporting the classification of the majority.

Topic Analysis
Automatically extracting the main topics and issues mentioned by stakeholders can be done using an unsupervised machine learning technique called latent Dirichlet allocation (LDA) [31]. To understand this methodology, it is helpful to think of a given piece of text as a collection of words, each with an associated probability of being attached to a given topic. Each topic can be thought of as a probability distribution comprised of the words one would expect to use when speaking about that topic [32]. For example, one would not expect to use words such as "algorithm", "Bayesian", or "natural language" when speaking about basic agriculture or mining practices, but they might expect to use them when speaking about machine learning techniques. Of course, there is some overlap in the words used between topics (e.g., "machine" or "distribution" might be used to talk about both agriculture and machine learning), but there is a different probability associated with each word in each topic, and the difference is even more starkly contrasted when looking at the probabilities of co-occurrences of words within a given topic.
LDA works by looking at the frequency of words within a corpus of text and grouping them into "bags" of words based on their probability of co-occurrence. By looking at the words in a given bag, analysts can deduce the general and specific meaning of the bag and the topics that make up the corpus. By searching individual texts for the words in these bags, it is possible to infer the topics being addressed by any given speaker, facilitating understanding of specific stakeholder concerns, which can then be addressed in order to maintain or improve a level of SL for a given project.
In SLaCDA, the LDA method was implemented using a Python wrapper around a java-based package called Mallet, which was developed by Andrew McCallum at the University of Massachusetts and released under a common public license [33].

The San Cristóbal Case
The texts used in this study came from the stakeholders of a mining project in Bolivia named Minera San Cristóbal. The mine has been the subject of several books [10,34,35] and chapters [36] owing to both its colorful history and its importance to the Bolivian economy. Personal interviews were conducted with leaders of stakeholder organizations every 15 months over a period of ten years. In total, 825 stakeholder interviews were conducted. The interviews followed the format described by Boutilier [4]. The level of SL granted to the company by the stakeholder organization representatives was solicited using rating scale measures [10].
The interviews included open-ended questions that generated texts about stakeholders' concerns and priorities.

Two Methods for Estimating the SL from Short Texts
The short texts from open-ended responses obtained in 825 interviews conducted in eight field periods (i.e., tracking 'waves') over ten years were manually classified as indicating a high or low SL. Eighty percent of these texts were used as training data to train the sentiment analysis algorithms mentioned above. Twenty percent were reserved for testing the accuracy of the algorithm.

Identifying Issues and Preoccupations Using Human Coders
The manual coding categories were developed using a grounded theory approach [37], which involved inductively developing tentative categories and then modifying them through a reflective process of testing their adequacy to capture the meaning in the open-ended text responses. A total of 89 categories, known as a "coding frame", were created to capture distinct ideas and topics of concern or interest to interviewees. Each category was assigned a numeric label as its code number. The creation of the coding frame required expert judgment and a knowledge of all the themes mentioned by all the stakeholders. Then, coders were trained each year on how to use the coding frame. The texts of the responses for each stakeholder were read by a trained coder who assigned to the text a code number for each idea as it was encountered.
By way of quantifying the findings in each measurement wave, a per-capita rate of mention was calculated for each idea that had a code. Then, the mention rates could be compared across sub-categories of stakeholders and across years. The process typically took two weeks per interviewing wave.
The 89 codes were submitted to a hierarchical cluster analysis using Johnson's average linkage method to create the clusters. Codes that were mentioned together by stakeholders were merged.
The decision about what level of collapsing to accept was made using criteria that emphasized the distinctiveness of each idea. For example, the code for "water scarcity" and "mitigations or infrastructure for water" were merged into a collapsed code called "water supply". The result was 52 merged codes for stakeholders' concerns.

SL Estimation: SLaCDA vs. Manual Coding
The algorithm agreed 83% of the time with human coders on the appropriate sentiment (positive (pos) or negative (neg)) of each piece of text in the testing set. This is typical performance for sentiment analysis algorithms [38], which range from the high 50s to the low 90s in terms of accuracy percentage. The human coders themselves only agreed 85% of the time and had to negotiate "correct" classification for the remaining 15% of cases. A chi square was calculated on a contingency table with the observed frequencies in the "pos" and "neg" categories for both the human coders and sentiment analyzer. The chi square was 273 at 1 degree of freedom, which was significant at p < 0.01. Therefore, the SLaCDA algorithm performs very well on the task of classifying the SL based on sentiment analysis of short texts.

Identifying Issues and Preoccupations: SLaCDA vs. Manual Coding
Across all eight waves of interviewing, the five most mentioned collapsed codes produced by the manual coding were labeled as: • "Jobs and development" (5.5 mentions per capita on average), • "Requests for assistance and educational infrastructure" (4.6 per capita), • "Desire for more communication" (3.3 per capita), • "Health care access and infrastructure" (3.1 per capita), and • "Water supply" (2.7 per capita).
As explained above, the SLaCDA program identified stakeholders' concerns by creating filtered lists or bags of relevant words. The particular algorithm used the required specification of the number of words per bag and the number of bags. After trial and error experimentation, it was found that the most interpretable results were produced when 25 bags were requested with ten words per bag. It required subjective expert judgment to label the bags. The five most mentioned bags were labeled as: • "Community administration and projects" (2547 mentions) • "Regional satisfaction with relations with the mining company" (2316 mentions) • "Community benefits" (2274 mentions) • "Regional economic challenges" (2081 mentions) • "Royalties and their recipients" (1995 mentions).
Thematically, the SLaCDA list overlaps with the manual top five list on most of the points, but both lists also captured some distinct topics. The manual list's mention of economic factors was captured by SLaCDA's bags related to regional satisfaction with the mining company, community benefits, and regional economic challenges. The manual method detected concerns about infrastructure in education, health, and water supply. SLaCDA detected these issues in the bags dealing with community projects and community benefits. The manual code for a desire for more communication overlapped with SLaCDA's satisfaction with relations.
The manual method highlighted preoccupations with health care access, while SLaCDA did not find it to be mentioned very often. Conversely, SLaCDA noted a frequent theme related to royalties and their recipients, while the manual method did not rate it as mentioned so often. In both these cases, both systems detected both issues but differed regarding their frequency of mention. The differences can be characterized as a result of contextualized coding by humans versus word-pair-oriented coding by machine. The machine was a little better at detecting issues that were described by stakeholders using the same words repeatedly. The human coders were more sensitive to issues described in many different words and phrases connected to the same idea.

Temporal: Detection of the Same Trends
Some of the manually coded issues that were condensed through a cluster analysis showed distinct trends across the years. SLaCDA would survive a disconfirmation test if it also showed the same trends. Pairs of codes from the two methods were selected for comparison if two conditions were met. First, they had to have the same or overlapping themes. Second, the manual data had to clearly show one of the four trend patterns in their polynomial trendline in Microsoft Excel. Sixteen pairs met both criteria. The four trend patterns were: decline (n = 5), increase (n = 7), fall and rise (i.e., U-shaped, n = 0), rise and fall (i.e., inverted-U, n = 4). The null hypothesis was that SLaCDA would not be able to match these trends at a rate better than chance.
The per-capita mention frequencies were converted into z-scores (standard scores) within years. Of the 16 comparison pairs, SLaCDA showed the same pattern as the manual method in 13 (81%) cases. A chi square was calculated on a contingency table with the observed frequencies in the decline, increase, and inverted-U categories for both methods. The chi square was 14.4 at 4 degrees of freedom, which was significant at p < 0.01. Twelve of the 16 were based on linear trends (i.e., increase or decline across the years) and could therefore be compared using Pearson's r correlation coefficient. The average correlation across the 12 was 0.66 and ranged from 0.40 to 0.88. Eleven of the 12 showed positive correlations of at least 0.50. Seven of the 12 correlations were significant at p < 0.10 at 6 degrees of freedom. Panels A to E on Figure 1 provide visual examples of the correlations on the five linear trend topics comparisons that were not significantly correlated at p < 0.05 (df. 6). Panel A shows the trends for both methods on topics dealing with improved family incomes. Panel B shows the trends on the topic of government income. The trends in Panel C are on the topic of restricted access to health care. Trends on compliance with environmental standards appear in Panel D. Panel E shows topics from both methods related to the establishment of collaborative relations. The second-order polynomial trendlines are included to give a picture of the extent to which the relations were and were not correlated. The Pearson's r correlation coefficients for the measures in each panel were A: 0.50, B: 0.56, C: 0.60, D: 0.57, and E: 0.40. In all cases, even the pairs that failed to reach statistical significance trended in the same direction.
Sustainability 2020, 12, x FOR PEER REVIEW 6 of 11 same trends. Pairs of codes from the two methods were selected for comparison if two conditions were met. First, they had to have the same or overlapping themes. Second, the manual data had to clearly show one of the four trend patterns in their polynomial trendline in Microsoft Excel. Sixteen pairs met both criteria. The four trend patterns were: decline (n = 5), increase (n = 7), fall and rise (i.e., U-shaped, n = 0), rise and fall (i.e., inverted-U, n = 4). The null hypothesis was that SLaCDA would not be able to match these trends at a rate better than chance. The per-capita mention frequencies were converted into z-scores (standard scores) within years. Of the 16 comparison pairs, SLaCDA showed the same pattern as the manual method in 13 (81%) cases. A chi square was calculated on a contingency table with the observed frequencies in the decline, increase, and inverted-U categories for both methods. The chi square was 14.4 at 4 degrees of freedom, which was significant at p < 0.01. Twelve of the 16 were based on linear trends (i.e., increase or decline across the years) and could therefore be compared using Pearson's r correlation coefficient.   The non-significance of these relationships probably resulted from more dissimilarity in the definitions of the manual codes and the word bags. Human coders were able to make finer distinctions than SLaCDA, partially because they had access to the identities of the stakeholders making the comments. The additional information, plus the ability to take account of additional semantic information, likely accounts for the gap that still remains between machine coding algorithms and human coders.

Inter-Group: Detection of the Same Inter-Group Differences
Differences in rates of mentions of topics were compared across subgroups of stakeholders. Statistical analyses were not possible, because the topic categories were not strictly comparable between the manual method and the SLaCDA method. However, in general, the subgroups of stakeholders with the most distinctive interests showed higher mention rates for the topic categories The non-significance of these relationships probably resulted from more dissimilarity in the definitions of the manual codes and the word bags. Human coders were able to make finer distinctions than SLaCDA, partially because they had access to the identities of the stakeholders making the comments. The additional information, plus the ability to take account of additional semantic information, likely accounts for the gap that still remains between machine coding algorithms and human coders.

Inter-Group: Detection of the Same Inter-Group Differences
Differences in rates of mentions of topics were compared across subgroups of stakeholders. Statistical analyses were not possible, because the topic categories were not strictly comparable between the manual method and the SLaCDA method. However, in general, the subgroups of stakeholders with the most distinctive interests showed higher mention rates for the topic categories most aligned with those interests regardless of the method used. Table 1 compares the topics of interest that were most mentioned under each method for the subgroups with the most distinctive interests. There were other interest groups whose interests were not as focused. For example, the news media and politicians tended to mention a broad range of topics without much focus in any one of them. The two methods did not produce such clear matchings for them, because fewer of the interests were distinctive to them as subgroups. Although these are not quantitative findings, the impression is that the more distinctive the interest of the group, the more both methods created a category for their interests, and the more both methods identified members of the group as having those interests. This is tautological for the manual method but not for SLaCDA because the SLaCDA topic module knew nothing of the group identities of the stakeholders mentioning word pairs.

Discussion
The now commonplace algorithms for sentiment analysis can be used as proxies for the SL. Calibration via weighting may be needed when data do not come from the most affected or motivated stakeholders.
The mallet algorithm uses Bayesian probabilities and by itself produces unstable word lists. The Mallet model already uses a Markov chain Monte Carlo algorithm for sampling the data, but we found a lack of stability of the bags across runs of the mallet model on the same piece of text. In order to stabilize these topic bags and adequately interpret them, we used an additional Monte Carlo model and interpreted the bags based on the most frequent occurrences across 100 runs of the Mallet model. Our superimposition of the Monte Carlo method created the stability and interpretability desired.
Beyond the contribution made by this paper in terms of the specific algorithms used, the study illustrates the potential of the entire strategy of using computational linguistics and natural language processing to promote positive stakeholder relations and manage socio-political risk. To that end, a distinction needs to be drawn between organizations with hundreds of thousands of involved stakeholders and those with only a few thousand or less. Different methods for gathering text data must be used in each case.
Organizations with hundreds of thousands of stakeholders can apply these techniques to text obtained from the internet because the volume of text is likely to be sufficient to give a valid assessment of the SL level and the issues behind it. Those who comment on social media are not a random sample of the public, but that is a strength in some ways, rather than a drawback. Those motivated enough and informed enough to comment on social media are more likely to influence the future opinions of the vast majority of the public who have less interest. Indeed, this logic has spawned a burgeoning industry of algorithm suppliers with ever more graphic interfaces and analytic capabilities integrated into their commercial offerings.
The case of organizations with only a few thousand stakeholders falls closer to that of many projects in infrastructure, natural resources, energy, and organizational restructuring in government or large corporations. Here, the number of commentators on social media, blogs, and websites is usually too small to even represent the views of opinion leaders. Part of the reason is that smaller populations are more likely to be linked in face-to-face networks. Face-to-face networks compete for influence with online networks. Research has shown that face-to-face networks are far more effective [39,40] in determining opinions. Therefore, with projects of this type, a different kind of input data is needed for SL and issue detection algorithms. As with all models, these algorithms are only as good as the data used to train and validate them. Therefore, data collection and validation for SLaCDA are ongoing. In the future, we hope to expand the validation to other projects with other data sources in order to ensure that the program performs well within and across sectors/projects. A database of SL measurements in diverse countries, languages, and sectors is currently being compiled for that purpose.
Open-ended responses to interview questions not only meet the need to understand the concerns of stakeholders but also provide the context in which further data about face-to-face network alliances can be solicited. From that, the stakeholder network can be constructed [41], which makes it possible to use centrality scores to weight stakeholders by their influence over the opinions of others [42]. Such weightings are strategically important because they reveal which stakeholder engagement initiatives, with precisely whom, would do the most to reduce opposition while strengthening support. In this scenario, the algorithms presented here not only reduce the time and money required for such analyses but also raise the effectiveness of the stakeholder engagement initiatives that are devised.

Conclusions
Social acceptance of resource and infrastructure projects is incredibly complex and dynamic. In order to mitigate the risks inherent to a lack of SL, it is necessary not only to know the general sentiment of stakeholders toward the focal organization, but also to know what the specific concerns of any given stakeholder are, as movement on these issues has a high potential to affect the individual sentiment, and consequently the SL. Previous manual-coding methods of identifying stakeholder issues are time consuming, to the point that by the time an issue is identified, it may already be either resolved or escalated. Using advanced machine learning techniques such as sentiment analysis and topic modeling, it is possible to reduce the time necessary for analysis from weeks or months to minutes or hours, allowing project managers to understand and address stakeholder concerns in real time. This paper shows the effectiveness of both sentiment analysis and topic modeling on texts gleaned from stakeholder interviews, with an 83% agreement on the sentiment of tested data as well as agreement on both the topics mentioned and the temporal trends in the frequency of those mentions. In future, the authors plan to expand the sentiment analysis training dataset and further validate the model with data from other countries, projects, and sectors. Additionally, efforts to link topics together based on the frequency of their co-mentions and analyze these topic networks in relation to their associated stakeholder networks are also planned for future publication. This ongoing work represents a powerful potential tool for project managers interested in better understanding their stakeholders and overcoming potential SL risks.