An Application of Natural Language Processing to Classify What Terrorists Say They Want

: Knowing what perpetrators want can inform strategies to achieve safe, secure, and sustainable societies. To help advance the body of knowledge in counterterrorism, this research applied natural language processing and machine learning techniques to a comprehensive database of terrorism events. A specially designed empirical topic modeling technique provided a machine-aided human decision process to glean six categories of perpetrator aims from the motive text narrative. Subsequently, six different machine learning models validated the aim categories based on the accuracy of their association with a different narrative ﬁeld, the event summary. The ROC-AUC scores of the classiﬁcation ranged from 86% to 93%. The Extreme Gradient Boosting model provided the best predictive performance. The intelligence community can use the identiﬁed aim categories to help understand the incentive structure of terrorist groups and customize strategies for dealing with them.


Introduction
Knowing the aims of terrorist attacks can help the intelligence community devise different strategies to deal with the perpetrators. Hence, the main goal of this research is to identify terrorist aims and classify them into a finite set of categories. This research defines aims of terrorists as distinctly different from the root causes of terrorism. Aims are the agenda or desired outcome of an attack (e.g., provoke a US attack upon Muslims, total war upon "non-believers", and the creation of a global Caliphate promoted by Al-Qaeda) whereas root causes explain the birth of terrorists (e.g., the desire to establish the Islamic rule worldwide) (Burke 2021).
There has been no internationally agreed-upon definition of terrorism (Cassese 2006). Reich (1990) submitted that no single perspective can explain the complex and diverse phenomenon of terrorism. The dominant paradigm of research in counterterrorism assumes that terrorists are rational actors that attack civilians for political ends (Géron 2017). Therefore, as rational actors, terrorists must have an agenda or aim. However, systemic models that attempt to explain aims involve a difficult tradeoff between generalizability and accuracy (Hastie et al. 2016). Abrahms (2008) suggests that there is no other question than "What terrorists really want?" that is more fundamental for devising effective strategies in counterterrorism. Different aims may require different strategies to deal with the perpetrators. Hence, classifying attack aims instead of root causes brings another perspective to help advance the counterterrorism agenda. The recent work of Wong et al. (2021) strongly suggested that such knowledge is integral to Government's policy responses in countering the threat of terrorism and violent extremism. Consequently, the objectives of this research are to:

1.
Apply natural language processing (NLP) techniques to extract features from the stated motive narrative of terrorist attacks to identify perpetrator aim categories (PACs).

2.
Validate the effectiveness of the PAC classification by evaluating the predictive performance of 11 different machine learning (ML) models applied to a different narrative field, the event summary. The significance of evaluating multiple types of ML models is that no single model type can best represent all datasets, which is an accepted tenant in the practice of ML (Aggarwal 2015).
Aside from identifying root causes, there has been no counterterrorism research to classify the general aims of perpetrators. Hence, this research contributes an artificial intelligence (AI) framework to help us understand what terrorists want based on a "revealed preference" gleaned from the narratives of what terrorists say they want. The framework combined NLP and ML techniques, both of which are subfields of AI (Aggarwal 2015), to aid human cognition in deriving the PACs. Intelligence communities can use the framework to classify motive narratives and guide decisions about the types of resources and mitigation efforts needed to best address the PAC identified for an event. Examples of resources are experts in certain fields of psychology, sociology, policymaking, law-enforcement, churches, and education. Knowing the PAC can guide more cost-effective selection of the appropriate resources much earlier in the counterterrorism effort.
The remainder of this article is organized as follows: Section 2 reviews related works to characterize terrorist motives. Section 3 applies NLP to extract PACs and Section 4 validates the classification using ML techniques. Section 5 discusses the significance of the classification results and the effectiveness of the AI methods. Section 6 recaps the approach, the current findings, and points to future work to examine group activity patterns and their association with the identified PACs. Krieger and Meierrieks (2011) cautioned that the phenomenon of terrorism is too complex to be reduced only a single root cause and panacea. Maszka (2018) found that although theoretical models in terrorism studies are useful, greater accuracy requires greater complexity. Recent analysis found that academic efforts to study the effectiveness of counterterrorism strategies are lacking (Balestrini 2021). Scholars in the literature on terrorism studies tend to analyze motives from one of three perspectives: (1) root causes, (2) planned impacts on the public, and (3) aims of the perpetrators. The next three subsections summarize the literature on each analytical perspective. The fourth subsection discusses related work that applied ML to better understand various aspects of terrorism.

Root Causes
Treistman (2021) observed that research on the causes of terrorism tends to focus on macroscopic trends at the national level without considering contextual factors such as social exclusion that affect individuals. An analysis of 44 terrorist organizations listed by the European Union found that 45% of them were politically motivated, nearly the same proportion was social-revolutionary motivated, while religiously motivated groups accounted for about 10% of the total (Rothenberger and Müller 2015). An earlier study highlighted racism as a root cause (Björgo 1993). Araújo et al. (2020) found that the perception of immigrants as a threat affected social attitudes towards foreign groups. A more recent study discussed the role of shame and revenge as a root cause in terrorism (Cottee 2021). Rigterink (2021) also recently found evidence that revenge explained the increase in terrorist attacks after the assassination of a terrorist leader. Van Um (2011) characterized root causes of terrorism based on need, such as the desire for unit-reinforcement and self-enrichment. Coccia (2018) presented statistical evidence that fatalities associated with terrorism are in regions where high population growth rates may result in inequality, subsistence stress and deprivation. Awareness of the need for self-enrichment increased with the proliferation of modern communications and social media (Höflinger 2021). On the other hand, user-generated data from social media can help analysts to profile users but the noise generated and data size proliferation can pose significant challenges (Bilal et al. 2019). Höflinger (2021) posits that modern communications technologies has blurred the boundaries between the actions of terror organizations and individuals. Monahan and Valeri (2018) posits that terrorists aim to maximize fear to achieve their strategic objectives. Terrorists seek attention and even claim responsibility for their acts (Morley and Leslie 2007). Masuku et al. (2021) found that terrorists use propaganda to expand their operations and seek sympathizers. News of terrorist attacks can threaten audiences' perceptions of societal safety (Tamborini et al. 2020). Hence, crowded urban areas are typically prime targets (Coaffee 2009). Canetti et al. (2021) found that hatemotivated attacks elicited stronger reactions from the public. One study found that among the tactics used, mass shootings and suicide bombings tend to result in the most causalities (Arce 2019).

Terrorist Aims
For studies in counterterrorism strategies, it is crucial to understand the decision processes of terrorists if we wish to plan countermeasures (Jaspersen and Montibeller 2020). Research found that deterrence requires knowledge about terrorist aims to inform proactive policies that can deter attacks (Enders and Su 2007). Kydd and Walter (2006) suggest that the aims of terrorists are to change minds through five strategies: attrition, intimidation, provocation, spoiling, and outbidding. Cottee and Hayward (2011) characterized terrorist motives from the perspective of three desires-the need for excitement, ultimate meaning, and glory. Kurtulus (2017) found evidence that terrorists aim to mobilize their constituency, avenge their fallen associates, and physically destroy their perceived enemies. Abrahms (2008) offered that the most common strategies to fight terrorism are to diminish the aims via strict no-concession policies or via appeasement. However, the aims of terrorist leaders and their operatives may be misaligned. In an analysis of terrorist propaganda videos, Abrahms et al. (2017) found that the operatives of terrorist leaders commit more indiscriminate violence than their leaders favor.

ML Applications
Luo and Qi (2021) used a random forest model to rank factors in terrorist attack risk and found that human-loss ranked highest. Huamaní et al. (2020) applied ML to the global terrorism database (GTD) to predict terrorist attacks worldwide. In related work, Guo et al. (2007) introduced a visualization environment to identify patterns in the GTD. Huff and Kertzer (2018) used ML to predict how the public classified incidents as terrorism based on language used in media coverage. In similar work, Das and Das (2019) used NLP to extract paraphrases from a large untagged text corpora to discover labels of crime reports. Mashechkin et al. (2019) proposed some language-independent algorithms to extract information from the Internet that contained terrorist or extremist patterns. Khalifa et al. (2019) applied association mining to the GTD to understand the nature of terrorist attacks in Egypt. Burnap and Williams (2014) found that statistical modeling and ML was effective in classifying "hate speech" on Twitter as measured by an F1 score of 0.95. Bassetti et al. (2018) used a Generalized Mixed Effects Regression Tree analysis to analyze economic correlates of Islamist political violence. Uddin et al. (2020) compared the performance of ML models in predicting attack success, involvement of suicide, weapon type, attack type, and region. Canhoto (2021) leveraged ML for counterterrorism by detecting and preventing money laundering, which is a key tool in terrorist operations. Hao et al. (2019) applied random forest to the GTD for spatio-temporal pattern discovery of terrorism incidents on the Indochina Peninsula. Mishra et al. (2020) applied directed graphs to the GTD for network relationship discovery of terrorist activities. Feng et al. (2020) applied Extreme Gradient Boosting to the GTD for predict casualty prediction from terrorist attacks.

Methodology
This section introduces the dataset used, the development of an empirical topic classification method to extract relevant features from motive narratives, and the predictive ML models used to validate the association of event summaries with the theorized PACs.

Data
The 2019 release of the Global Terrorism Database (GTD) contained more than 201,183 records of terrorist events from 1970 to 2019 (START 2020). Of the 135 fields containing information about each event, 42 remained after removing fields that were missing more than 35% of the data. Eighteen of the 42 fields contained textual data, but only two were relevant to the NLP procedures. The two relevant text fields were "summary" which contained a brief description of the terrorist event, and "motive" which briefly described the motive, if known, stated in media reports. Therefore, the reported motive is akin to a "stated preference" of what the perpetrators say was the aim of their attack. Figure 1 summarizes the attack frequency aggregated by five-year blocks from 1970 to 2019, and by the portion with motives known. Only 10,747 or 5.3% of the records described reported motives, based on responsibility claims by the perpetrators. The trend was that the number of records with known motives increased to a stable level after 1998. Therefore, this analysis focused on records with claimed motives for the 20 years spanning 1999 through 2019. This filter yielded 9460 records, which represented only 12% fewer records than all records containing known motives. random forest to the GTD for spatio-temporal pattern discovery of terrorism incidents on the Indochina Peninsula. Mishra et al. (2020) applied directed graphs to the GTD for network relationship discovery of terrorist activities. Feng et al. (2020) applied Extreme Gradient Boosting to the GTD for predict casualty prediction from terrorist attacks.

Methodology
This section introduces the dataset used, the development of an empirical topic classification method to extract relevant features from motive narratives, and the predictive ML models used to validate the association of event summaries with the theorized PACs.

Data
The 2019 release of the Global Terrorism Database (GTD) contained more than 201,183 records of terrorist events from 1970 to 2019 (START 2020). Of the 135 fields containing information about each event, 42 remained after removing fields that were missing more than 35% of the data. Eighteen of the 42 fields contained textual data, but only two were relevant to the NLP procedures. The two relevant text fields were "summary" which contained a brief description of the terrorist event, and "motive" which briefly described the motive, if known, stated in media reports. Therefore, the reported motive is akin to a "stated preference" of what the perpetrators say was the aim of their attack. Figure 1 summarizes the attack frequency aggregated by five-year blocks from 1970 to 2019, and by the portion with motives known. Only 10,747 or 5.3% of the records described reported motives, based on responsibility claims by the perpetrators. The trend was that the number of records with known motives increased to a stable level after 1998. Therefore, this analysis focused on records with claimed motives for the 20 years spanning 1999 through 2019. This filter yielded 9460 records, which represented only 12% fewer records than all records containing known motives.

Topic Classification
NLP is a subset of the field of Artificial Intelligence (Aggarwal 2015) . Topic modeling is an NLP method that requires both art and science to identify and quantify the mix of topics within a document (Lane et al. 2019). Latent Dirichlet Allocation (LDA) is the most prominent among several techniques currently available (Padmaja et al. 2018). LDA identifies topics based on a multinomial distribution of words and assumes that each document covers a distinct topic. The available topic modeling algorithms do not accommodate user inputs about application context or topic names. After applying LDA to the mo-

Topic Classification
NLP is a subset of the field of Artificial Intelligence (Aggarwal 2015). Topic modeling is an NLP method that requires both art and science to identify and quantify the mix of topics within a document (Lane et al. 2019). Latent Dirichlet Allocation (LDA) is the most prominent among several techniques currently available (Padmaja et al. 2018). LDA identifies topics based on a multinomial distribution of words and assumes that each document covers a distinct topic. The available topic modeling algorithms do not accommodate user inputs about application context or topic names. After applying LDA to the motive narrative field of the GTD, the algorithm identified topics based on a group of frequent words. Empirical analysis of the word groups by the author revealed that they tended to associate with weapon types, attack types, and perpetrator groups, each with some mix of probabilities. Hence, LDA did not help the author to identify PACs.
Given the above deficiencies, the author invented an empirical topic classification (ETC) technique that combined NLP methods with human cognition to identify word features that resulted in a more meaningful identification of PACs. Figure 2 shows the Soc. Sci. 2022, 11, 23 5 of 15 workflow, including custom and existing NLP procedures. The data processing layer extracted motive narratives from the selected year range of records. The author then initialized a list of potential PACs based on expertise, data sources listed in the GTD, keywords gleaned from the literature, as well as the keywords selected from the motive narrative.
tive narrative field of the GTD, the algorithm identified topics based on a group of frequent words. Empirical analysis of the word groups by the author revealed that they tended to associate with weapon types, attack types, and perpetrator groups, each with some mix of probabilities. Hence, LDA did not help the author to identify PACs.
Given the above deficiencies, the author invented an empirical topic classification (ETC) technique that combined NLP methods with human cognition to identify word features that resulted in a more meaningful identification of PACs. Figure 2 shows the workflow, including custom and existing NLP procedures. The data processing layer extracted motive narratives from the selected year range of records. The author then initialized a list of potential PACs based on expertise, data sources listed in the GTD, keywords gleaned from the literature, as well as the keywords selected from the motive narrative. The text mining layer then applied NLP procedures to normalize the text and extract relevant word features. The lowercase transformation prevented word redundancy and the tokenize procedure extracted individual words into a feature vector that represented each narrative of a record. The stemming procedure reduced feature dimensionality by transforming all inflected forms of a word to their lexical stem (Jones and Willett 1997). The stop words filter eliminated frequently used words such as "the," "on," "at," "which," "and," "but," to reduce bias noise from a lack of meaning. The outlier word filter further reduced noise by allowing the downstream word sort procedure to highlight meaningful words that appear in many of the documents. The bag-of-words model extracted the document frequency (DF) of words based on the number of documents in the corpus that contained those words.
With each iteration of the procedural flow, the BOW procedure of the ETC framework helped human cognition to assign a PAC to the unlabeled targets based on DF rank, which rapidly reduced the number of unassigned document targets with each iteration. This strategy also helped human cognition to more easily identify additional stop words and keywords associated with a PAC. The Corpus View procedure sorted the word features of the unlabeled documents by their DF and highlighted the context of a selected The text mining layer then applied NLP procedures to normalize the text and extract relevant word features. The lowercase transformation prevented word redundancy and the tokenize procedure extracted individual words into a feature vector that represented each narrative of a record. The stemming procedure reduced feature dimensionality by transforming all inflected forms of a word to their lexical stem (Jones and Willett 1997). The stop words filter eliminated frequently used words such as "the", "on", "at", "which", "and", "but", to reduce bias noise from a lack of meaning. The outlier word filter further reduced noise by allowing the downstream word sort procedure to highlight meaningful words that appear in many of the documents. The bag-of-words model extracted the document frequency (DF) of words based on the number of documents in the corpus that contained those words.
With each iteration of the procedural flow, the BOW procedure of the ETC framework helped human cognition to assign a PAC to the unlabeled targets based on DF rank, which rapidly reduced the number of unassigned document targets with each iteration. This strategy also helped human cognition to more easily identify additional stop words and keywords associated with a PAC. The Corpus View procedure sorted the word features of the unlabeled documents by their DF and highlighted the context of a selected word across all documents. This strategy helped human cognition to focus the ETC on the most frequent words, and to achieve an exponential convergence of label assignments. The iterations stopped when the most frequent remaining words appeared in less than 0.1% of the documents. The rational for the termination condition was that infrequent words added noise to the ML process, and that could lead to model over-fitting in the subsequent ML layer. Soc. Sci. 2022, 11, 23 6 of 15 In summary, the ETC workflow was a machine-aided human decision process that leveraged the domain knowledge and experience of the author to help assign appropriate PACs to motive narratives. It would have been difficult to optimize the number of PACs without using human cognition to understand the meanings of words within the context used and to grasp similarities in the meaning among words that can characterize a motive narrative. The feedback loop between the text mining and ETC layers indicates where the cognition process injected decisions about combining or expanding the selected PACs.

Machine Learning
The ETC workflow discussed above helped the author to assign class labels to each record based only on the motive narrative. The ML process reused the text mining procedures to extract word features from the event summary narrative, which is a different field in the dataset. Unlike the motive field, the event summary narrative described the actions taken and the outcomes of attacks. The ML methods used features extracted from the event summary narrative to predict the PACs, which were the labels derived from the motive narrative. Therefore, good predictive performance would validate the association of the assigned PACs with the event narrative. Figure 3 shows the workflow of the data, text mining, ML, and output layers. The bag-of-words (BOW) model simplified the representation of each narrative as an array of unordered words derived from the corpus (Aggarwal 2015). Hence, the input to the ML model is a data table where each row contains a word array (features) with an associated class label assigned by the ETC.
word across all documents. This strategy helped human cognition to focus the ETC on the most frequent words, and to achieve an exponential convergence of label assignments. The iterations stopped when the most frequent remaining words appeared in less than 0.1% of the documents. The rational for the termination condition was that infrequent words added noise to the ML process, and that could lead to model over-fitting in the subsequent ML layer.
In summary, the ETC workflow was a machine-aided human decision process that leveraged the domain knowledge and experience of the author to help assign appropriate PACs to motive narratives. It would have been difficult to optimize the number of PACs without using human cognition to understand the meanings of words within the context used and to grasp similarities in the meaning among words that can characterize a motive narrative. The feedback loop between the text mining and ETC layers indicates where the cognition process injected decisions about combining or expanding the selected PACs.

Machine Learning
The ETC workflow discussed above helped the author to assign class labels to each record based only on the motive narrative. The ML process reused the text mining procedures to extract word features from the event summary narrative, which is a different field in the dataset. Unlike the motive field, the event summary narrative described the actions taken and the outcomes of attacks. The ML methods used features extracted from the event summary narrative to predict the PACs, which were the labels derived from the motive narrative. Therefore, good predictive performance would validate the association of the assigned PACs with the event narrative. Figure 3 shows the workflow of the data, text mining, ML, and output layers. The bag-of-words (BOW) model simplified the representation of each narrative as an array of unordered words derived from the corpus (Aggarwal 2015). Hence, the input to the ML model is a data table where each row contains a word array (features) with an associated class label assigned by the ETC.  The cross-validation (K-fold) procedure cyclically partitioned the narratives into ten subsets. With each iteration, the model output was a prediction error rate for one of the partitions after training on the union of the remaining partitions. The advantage of the kfold cross validation approach over the traditional single-split of the data into a training and a testing set is that all the data served as test data, which minimized bias and maximized model generalization. The performance evaluation procedure reported the average value of each test metric across all train-test cycles. This cyclical method also helped the author to balance model complexity by adjusting the hyperparameters to tradeoff overfitting and underfitting to the training data. That is, the cross-validation helps to balance model complexity for generalization on unseen data (Géron 2017).
The hyperparameter tuning procedure iteratively adjusted the various model parameters within a defined range until the selected performance metric showed no further improvement. The training procedures used the one-versus-rest (OvR) method to transform the multiclass classification problem into binary classification problem with respect to each class (Géron 2017). This strategy enabled the use of models that natively implement only binary classification. Hence, it became easier to use identical metrics to compare the performance of all models. Table 1 summarizes the fundamental theory of operation for each of the 11 ML models used to validate the classification of PACs derived by machineaided human decision. The last column of Table 1 includes references that expands on the details of the mathematical formulation and practical implementation of each ML model. Table 1. Machine learning models applied in the procedural framework.

Model Description Reference
Logistic Regression (LR) Fits the data to a logistic function of the linear combination of attributes to estimate the probability of a binary class.
Aggarwal (2015), Géron (2017) Stochastic Gradient Descent (SGD) Fits a linear multivariate function to the data by randomly selecting data instances to calculate parameter updates that minimize a selected loss function.
Géron (2017) Decision Tree (DT) Grows a logic tree by recursively splitting nodes to maximize the purity of child or leaf nodes.

Géron (2017) Hastie et al. (2016)
Random Forest (RF) Grows many shallow and partial decision trees by randomly selecting a subset of attributes and data subset to split nodes, and then uses majority vote to predict the class.
Aggarwal (2015), Breiman (2001) AdaBoost (ADB) Sequentially build shallow decision trees (stumps) that improve on the prediction errors of previous trees, and then uses majority vote to predict the class.

Multi-layer Perceptron (MLP)
A feed-forward and fully connected artificial neural network that learns a function with one or more inner layers of neurons. (2016) Naïve Bayes (NB) Uses Bayes probability theory to predict a class given the observed set of features, and assuming that they are independent.

Géron (2017) Veen and Leijnen
Aggarwal (2015), James et al. (2013) k-Nearest Neighbors (kNN) Predicts a class based on the majority vote of its k-nearest neighbors in feature space.
Chen and Guestrin (2016), Feng et al. (2020) The performance evaluation procedure used five scoring metrics: classification accuracy (CA), precision (PR), recall (RC), F1-score, and ROC-AUC score. Each metric used some combination of the classification result based on their the true positive (TP), true-negative (TN), false-positive (FP), and false-negative (FN) rates. The CA was the proportion of correct predictions. The PR, a measure of specificity, was the proportion of positive predictions Soc. Sci. 2022, 11, 23 8 of 15 that were correct: PR = TP/(TP + FP). The RC, a measure of sensitivity, was the proportion of positive predictions recalled from the true positives: RC = TP/(TP + FN). F1 was the harmonic mean of sensitivity and specificity: F1 = TP/(TP + α) where α = (FN + FP)/2. Each metric is useful for interpreting the performance of a classifier and they complement each other. CA is simple to compute and provides an intuitive sense for comparing the accuracy of each method. However, CA does not effectively characterize the intelligence of a classifier because it can produce a high score for data with high class imbalance if the model predicts the dominant class each time. CA also does not provide any information about the types of errors that a classifier made to help the model designer make decisions to tradeoff FP and FN errors. Therefore, the model designer uses PR, RC, and F1 scores to evaluate the impact of classifier decision boundary thresholds. Although the CA, PR, RC, and F1 scores are simple and useful, they do not provide an unbiased result under high class imbalance. Therefore, the performance evaluation was based on the ROC-AUC score, which integrated TP as a function of FP across a range of sensitivity thresholds to avoid bias from class imbalance (Fawcett 2006).
The ROC-AUC stands for "area under the curve" of the receiver operating characteristic (ROC), where the "curve" is a two-dimensional plot of the TP rate against the FP rate, both as a function of the class membership probability threshold. For brevity, the remainder of this article refers to the ROC-AUC score as simply the AUC score. The AUC provides a relative comparison of classifier performance across a range of decision thresholds. Users can better interpret CA than AUC when evaluating the confidence of a class prediction. However, the ROC can guide threshold selection to maximize CA, PR, RC, and F1 performance. The analysis also included the scores of a null classifier to serve as a baseline for comparing relative performance. The null classifier is a no-skill model that predicts the dominant class every time.

Results
The next two subsections present the results of the ETC and the ML classification to validate the accuracy of associating the summary narratives with PACs.

Topic Classification
The author terminated the ETC procedure of Figure 2 after labeling 8400 of the 9460 records where the remaining dominant words appeared in less than 0.1% of the unassigned records. Table 2 summarizes the six PACs identified and provides a definition for each. Table 2 also list the number of motive narratives or documents (D) classified into each PAC and the number of word features (F) identified for the collection of documents within each PAC. The last column of Table 2 shows a list of some notable word features and their frequency of occurrence across documents within that PAC. The Corpus View helped the author to both identify and verify the context of each keyword and to reconcile them with the assigned PAC. Figure 4 shows the regional makeup of each PAC by the number of attacks assigned that aim category. Note that this chart shows the aim categories for only those terrorist events with motives reported, not all terrorist events. Table 3 complements Figure 4 by providing details of the world region proportions that make up each PAC.

PAC Definition Corpus Motive Narrative Keywords and Their Document Frequency
Retaliate Reaction to an action or situation.  Table 4 summarizes the performance metric for each model, ordered by their average AUC score from the OvR binary classification. Table 4 also shows the hyperparameter settings as [a:b, x] where a:b represents the range of values evaluated using a combined grid-search and manual search method, and x is the tuned value of the final classifier. The   Table 4 summarizes the performance metric for each model, ordered by their average AUC score from the OvR binary classification. Table 4 also shows the hyperparameter settings as [a:b, x] where a:b represents the range of values evaluated using a combined grid-search and manual search method, and x is the tuned value of the final classifier. The training and tuning procedures used 10-fold cross validation and stratified sampling. Common hyperparameters for several of the models were the learning rate (L), loss function (LF), regularization (R) setting, and optimizer algorithm (OA). The hyperparameters for tree-based methods were the number of decision trees (N) and the minimum number of samples to retain in the leaves. For kNN, N was the number of nearest neighbors. The XGB algorithm provided the best performance across all metrics. Table 5 summarizes the performance metrics of the XGB algorithm for each PAC in the OvR classification. Figure 5 provides a visual validation of the differences in AUC among the classifications. The gray background highlights the area that is equivalent to the AUC. The figure shows the ROC mean (green curve) and variations (error bars) for the 10 cross-validation results of the XGB algorithm when predicting the PAC presence for "Weaken" and "Despise" which represents the best and worst AUC scores, respectively. The TP rate of the ROC curve for "Weaken" increases more rapidly than that of the ROC curve for "Despise" which has a less sharp curve. For reference, Figure 5 includes the performance of the Null model, which is the straight orange line.

Discussion
The AUC values ranged from 0.624 for the SVM model to 0.900 for the XGB model, with the Null model achieving only 0.499. The ROC uses the entire dataset to compute probability rates (TP and FP) between 0 and 1 as a function of decision thresholds applied to the output of each classifier. Therefore, the AUC is inherently a probabilistic score. This means that decision makers can interpret differences in AUC values on a probability scale when selecting models. The six PACs accurately classify the summary narratives of the events. That is, the RF, LR, MLP, and GB models provided similarly satisfactory performance. The null model provided the worst AUC performance as expected.
Although in a statistical sense, the six models of XGB, RF, LR, MLP, NB, and GB provided similar performance, the consistent albeit slight performance edge of XGB across all scores is worth an interpretation. As explained in the introduction, an important tenant of ML is that no single model type can best represent all datasets (Aggarwal 2015). For example, SVM does best when there is a clear separation of classes across hyperplanes, whereas k-NN does well when classes form clusters in feature space (Géron 2017). XGB incorporates a lot of randomness as it sequentially builds an ensemble of models to reduce the errors of previous models (Natekin and Knoll 2013). Therefore, XGB is more likely to find class associations in datasets that lack clear hyperplane separation or clustering because it can randomly discover them. The original reference described the algorithm as "especially appropriate for mining less than clean data" (Friedman 2001).
The CA performance of the XGB classification ranged from 0.860 to 0.943 for each PAC (Table 5). The table summarizes the mean, standard deviation (STD), and coefficient of variation (CV) for each score. This result indicated that even though XGB made more prediction errors for some classes than others, the performance spread as measured by CV was less than 5%, suggesting a consistent performance across all classes. The variations in CA for each target class was partially due to imbalance in the number of training examples across the dataset (Table 2). For example, the PAC of "Retaliate" had 2956 documents whereas "Weaken" had 76% fewer documents to serve as examples in the training. Another reason for the differences were inherent variations due to randomness in the cross-validation processes.
The main advantage of the ETC over the available topic modeling techniques was the accommodation of human cognition into the process to yield a meaningful set of PACs. Hence, the workflow enabled a machine-aided human decision process. Consequently, the process was naturally slower than a fully automated statistical method. Hence, the tradeoff for more meaningful topic classification was speed. For example, using the same computing resources, LDA took a few minutes whereas the ETC took an entire day. As discussed previously, LDA did not provide meaningful results whereas ETC did. In this application, the ETC proved to be effective because of the relatively small size of the dataset, and the similarities of contextual keywords across many documents.
A potential limitation of the approach is that the motives and summary narratives of the GTD may have reporting biases. However, it is unlikely that any potential bias would have been systematic and consistent across all reports. Machine learning algorithms continue to evolve in their capabilities. The framework discussed is sufficiently general to include additional model types when they become available, but it is possible that some approaches may require a modification to the framework for compatibility. General users may be more familiar with CA as an evaluation metric. Hence, the framework may require advanced users to evaluate other performance metrics such as F1 and AUC to guide more informed decision-making and interpretation.
One interpretation of Figure 4 was that among known motives, aims to "retaliate" were nearly four-times more revealed by terrorist responsibility claims than aims to "weaken" the target. This suggests that terrorists were more likely to claim credit for retaliations than to advertise aims to weaken or distract their enemies. Another interesting observation is that PACs were more generally known for attacks in the Middle East, North Africa, and South Asia than for other regions of the world.
A final observation is that the six PACs represent two general need categories. The first need is to instigate change by weakening the target, forcing a situation, or intimidating the opposition. The second need is to express feelings through despise, protest, or retaliation. These findings, with the help of the two AI methods (NLP and ML), add to the diversity of knowledge about terrorist behaviors.

Conclusions
A comprehensive review of the research on terrorism revealed a dominant assumption that terrorists attack civilians to maximize political ends. However, "revealed preference" by the reported motives of terrorist attacks points to a more diverse set of aims. This research applied two artificial intelligence (AI) techniques to help the author distill perpetrator aims from a comprehensive record of terrorist attacks across the world. The method helped the author to identify six aim categories, which were to despise, protest, retaliate, weaken, force, and intimidate.
The performance of the machine learning (ML) techniques validated that the six perpetrator aim categories (PACs), derived by machine-aided human decision, accurately classified the data. The method first applied natural language processing (NLP) techniques to invent an empirical topic classification workflow that extracted text features from the available motive narratives of terrorist events. The method then applied 11 distinct types of ML models to text features extracted from the summary narrative of each event, which was a different field from the motive narrative that was not involved in the model training. The six best-performing ML models predicted PACs with accuracies ranging from 86% to 94% and corresponding AUC scores ranging from 86% to 93%. The Extreme Gradient Boosting model provided the best predictive performance with AUC, classification accuracy, and F1 scores of 0.90, 0.91, and 0.90, respectively. These values provide a probabilistic confidence level of the classification results relative to the outcome from other models evaluated. The level of confidence can inform resource allocation decisions to address an aim identified from terrorist event reports.
A common strategy in counterterrorism is to diminish incentives, which must be known beforehand. Hence, the intelligence community can view the PACs as an incentive structure to help identify and customize strategies for deterrence. To further the research agenda in counterterrorism, future work will examine if patterns of terrorist activities can be associated with one of the six aim categories. Future work will also assess possible reasons for the dominance of certain PACs in different regions of the world.