A Largely Unsupervised Domain-Independent Qualitative Data Extraction Approach for Empirical Agent-Based Model Development

: Agent-based model (ABM) development needs information on system components and interactions. Qualitative narratives contain contextually rich system information beneﬁcial for ABM conceptualization. Traditional qualitative data extraction is manual, complex, and time-and resource-consuming. Moreover, manual data extraction is often biased and may produce questionable and unreliable models. A possible alternative is to employ automated approaches borrowed from Artiﬁcial Intelligence. This study presents a largely unsupervised qualitative data extraction framework for ABM development. Using semantic and syntactic Natural Language Processing tools, our methodology extracts information on system agents, their attributes, and actions and interactions. In addition to expediting information extraction for ABM, the largely unsupervised approach also minimizes biases arising from modelers’ preconceptions about target systems. We also introduce automatic and manual noise-reduction stages to make the framework usable on large semi-structured datasets. We demonstrate the approach by developing a conceptual ABM of household food security in rural Mali. The data for the model contain a large set of semi-structured qualitative ﬁeld interviews. The data extraction is swift, predominantly automatic, and devoid of human manipulation. We contextualize the model manually using the extracted information. We also put the conceptual model to stakeholder evaluation for added credibility and validity.


Introduction
Qualitative data provide thick contextual information [1][2][3][4] that can support reliable complex system model development. Qualitative data analysis explores systems components, their complex relationships, and behavior [3][4][5]) and provides a structured framework that can guide the formulation of quantitative models [6][7][8][9][10]. However, qualitative research is complex, and time-and resource-consuming [1,4]. Data analysis usually involves keyword-based data extraction and evaluation that requires multiple coders to reduce biases. Moreover, model development using qualitative data requires multiple, lengthy, and expensive stakeholder interactions [11,12], which adds to its inconvenience. Consequently, quantitative modelers often avoid using qualitative data for their model development. Modelers often skip qualitative data analysis or use unorthodox approaches for framework development, which may lead to failed capturing of target systems' complex dynamics and produce inaccurate and unreliable outputs [13].
The development in the information technology sector has substantially increased access to qualitative data over the past few decades. Harvesting extensive credible data is crucial for reliable model development. Increased access to voluminous data presents a

Background, Materials, and Methods
Software engineers have been exploring various supervised and unsupervised approaches for information extraction. In supervised approaches, syntactical patterns are defined [42], and text is manually scanned for such patterns. In unsupervised information extraction, the machine does the pattern matching.
Supervised approaches are reliable but slow. Contrarily, the faster, unsupervised approaches are difficult and prone to errors, mainly due to word sense ambiguation [43]. A purely syntactical analysis cannot capture the nuanced meaning of texts, which is often the culprit of the problem. Recently, pattern matching also involves semantic analysis. External databases of hierarchically structured words such as WordNet or VerbNet [44] and machine learning tools are increasingly used for understanding semantics for reduced word sense ambiguity [45,46]. NLP toolkits are increasingly used for unsupervised pattern matching and information extraction [47][48][49][50][51][52]. Tools such as lemmatizing, tokenizing, stemming, and part-of-speech tagging [53] are helpful for syntactic information extraction. These tools can normalize texts and identify subjects and main verbs from their sentences.
The ability to convert highly unstructured texts to structured information through predominantly unsupervised approaches is one of the main advantages of NLP in qualitative data analysis. NLP efficiently analyzes intertextual relationships using syntactic and semantics algorithms. Approaches such as word co-occurrence statistics and sentiment analysis [54] are beneficial for domain modeling [42] and for exploring contextual and behavioral information from textual data. Similarly, its efficiency in pattern matching for information extraction is essential for model development.
Although present for decades in object-oriented programming and database development [55,56], NLP has a minimal footprint in ABM development. The introduction of NLP in ABM development is very recent. Refs. [14,57] used NLP to model human cognition through word embedding, which is a contextually analyzed vector representation of a text. The procedure places closely related texts next to each other. Specifically, placing agents with similar worldviews together to support theorizing agent decision-making. Although their approach helps develop agent decision-making, it is not well-equipped for developing a comprehensive agent architecture.
Another example is the study by [58], who applied NLP in conjunction with machine learning to create an ABM structure from unstructured textual data. In their framework, texts are translated to the agent-attribute-rule framework. They define agents as nouns (e.g., person and place) that perform some actions and attributes as words that represent some variables. Similarly, sentences containing agents or attributes and action verbs are considered rules. The primary goal of their approach is to create an ABM structure mainly for communicating the model to non-modelers.
As with machine learning approaches in general, Padilla et al.'s approach required a large amount of training data. They used ten highly concise, formally written ABM descriptions from published journals as training datasets. Another limitation, according to the researchers, was the lack of precise distinctions between agents and attributes, attributes could also be nouns, which might confuse the machine. Repeated training with extensive data effectively increased the accuracy of the agent-attribute rule detection. However, the model frequently resulted in underpredictions and overpredictions.
Our work is of significant importance in the context of ABM development, particularly in relation to the utilization of machine learning and artificial intelligence. The recent research papers, refs. [30,59,60], also emphasized the role of these technologies in this domain. Ref. [60] go as far as to argue that natural language processing (NLP) can potentially replace the conventional method of developing ABMs, which heavily relies on field interviews. This perspective highlights the relevance and timeliness of our work, as we effectively incorporated machine learning and artificial intelligence techniques into our ABM development process.
Additionally, we discussed the relevance of prior works such as [14,57,58], and that exhibit similarities to our approach. However, these studies lack certain aspects of model development that our proposed methodology aims to address. For instance, ref. [58] relied on extensive training datasets and struggled to differentiate between agents and attributes effectively, whereas our methodology overcomes these limitations. Furthermore, unlike Runck's approach, which primarily focuses on developing agents' decision-making abilities, our approach strives to create a comprehensive agent architecture.

The Proposed Framework
In response to these limitations, our study proposes and tests a largely unsupervised domain-independent approach for developing ABM structures from informal semistructured interviews using Python-based semantic and syntactic NLP tools ( Figure 1). The method primarily uses syntactic NLP approaches for information extraction directly to the object-oriented programming (OOP) framework (i.e., agents, attributes, and actions/interactions) using widely accepted approaches in database design and OOP [61]. Database designers and OOP programmers generally exploit the syntactic structure of sentences for information extraction. Syntactic analysis usually treats the subject of a sentence as a class (an entity for a database) and the main verb as a method (a relationship for a database). Since the approach is not based on machine learning, it does not require large training data. The semantic analysis is limited to external static datasets such as WordNet (https://wordnet.princeton.edu/) (accessed on 8 July 2020) and VerbNet (https://verbs.colorado.edu/verbnet/) (accessed on 21 July 2020).
In the proposed approach, information extraction includes systems agents, their actions, and interactions from qualitative data for model development using syntactic and semantic NLP tools. As our information extraction approach is primarily unsupervised and does not require manual interventions, we argue that, in addition to being efficient, it reduces the potential for subjectivity and biases arising from modelers' preconceptions about target systems.
The extracted information is then represented using Unified Modeling Language (UML) for an object-oriented model development platform. UML is a standardized graphical representation of software development [62]. It has a set of well-defined class and activity diagrams that effectively represent the inner workings of ABMs [63]. UML diagrams represent systems classes, their attributes, and actions. Identified candidate agents, attributes, and actions were manually arranged in the UML structure for supporting model software development. Although there are other forms of graphical ABM representations such as Petri Nets [64], Conceptual Model for Simulation [29], and sequence and activity  In the proposed approach, information extraction includes systems agents, their a tions, and interactions from qualitative data for model development using syntactic a semantic NLP tools. As our information extraction approach is primarily unsupervis In our approach, model development is mainly unsupervised and involves the following steps ( Figure 1):
Tagging and information extraction; 5.
Model evaluation.
Steps one and two are required since semi-structured interviews often contain redundant or inflected texts that can bog down NLP analysis. Hence, removing non-informative contents from large textual data is highly recommended at the start of the analysis. NLP is well-equipped with stop words removal tools that can effectively remove redundant texts. Similarly, tools such as stemming and lemmatizing help normalize texts to their base forms [67].
Step three is data volume reduction, which can tremendously speed up NLP analyses. Traditional volume reduction approaches usually contain highly supervised keywordbased methods. Data analysts use predefined keywords to select and extract sentences perceived to be relevant [68]. Keyword identification generally requires a priori knowledge of the system and is often bias-prone. Consequently, we recommend a domain-independent unsupervised Term Frequency Inverse Document Frequency (TFIDF) approach [69] that eliminates manual keyword identification requirements. The approach provides weightage to individual words based on their uniqueness and machine-perceived importance. The TFIDF differentiates between important and common words by comparing their frequency in individual documents and across entire texts. Sentences that have high cumulative TFIDF scores are perceived to have higher importance. Given a document collection D, a word w, and an individual document d ε D, TFIIDF can be defined as follows: where fw,d equals the number of times w appears in d, |D| is the size of the corpus, and fw,D equals the number of documents in which w appears in D [69].
Step four involves tagging and information extraction. Once the preprocessed data are reduced, we move to tagging agents, attributes, and actions/interactions that can occur. We propose the following approaches for tagging agent architecture: Candidate agents: Following the conventional approaches in database design and OOP [61], we propose identifying the subjects of sentences as candidate agents. For instance, the farmer in 'the farmer grows cotton' can be a candidate agent. NLP has welldeveloped tools such as part-of-speech tagger and named-entity tagger that can be used to detect subjects of sentences.
Candidate actions: The main verbs of sentences can become candidate actions. The main verbs need candidate agents as the subject of the sentences. For example, in the sentence 'the farmer grows cotton,' the farmer is a candidate agent, and the subject of the sentence; grows is the main verb and, hence, a candidate action.
Candidate attributes: Attributes are properties inherent to the agents. Sentences containing candidate agents as subjects and be or have as their primary (non-auxiliary) verbs provide attribute information, e.g., 'the farmer is a member of a cooperative,' and 'the farmer has 10 ha of land.' Additionally, the use of possessive words also indicates attributes, e.g., the cow in the sentence 'my cow is very small' is an attribute.
Candidate interactions: Main verbs indicating relationships between two candidate agents are identified as interactions. Hence the sentences containing two or more candidate agents provide information on interactions, e.g., 'The government trains the farmers. ' Since the data tagging is strictly unsupervised, false positives are likely to occur. The algorithm can over-predict agents, as the subjects of all the sentences are treated as candidate agents. In ABM, however, agents are defined as autonomous actors, they act and make decisions. Hence, we propose to use a hard-coded list of action verbs (e.g., eat, grow, and walk) and decision verbs (e.g., choose, decide, and think) to filter agents from the list of candidate agents. Only the candidate agents that use both types of verbs qualify as agents. Candidate agents not using both verbs are categorized as entities that may be subjected to manual evaluation. Similarly, people use different terminologies that are semantically similar. We recommend using external databases such as WordNet to group semantically similar terminologies.
Step five involves supervised contextualization and evaluation. While the unsupervised analysis reduces data volume and translates semi-structured interviews to the agent-actionattribute structure, noise can percolate to the outputs since the process is unsupervised. Additionally, the outputs need to be contextualized. Consequently, we suggest performing a series of supervised output filtration followed by manual contextualization and validation. The domain-independent unsupervised analysis extracts individual sentences that can sometimes be ambiguous or domain irrelevant. Hence the output should be filtered based on ambiguity and domain relevancy. Once output filtration is performed, contextual structures can be developed and validated with domain experts and stakeholders.
The last two steps (UML/model conceptualization and model evaluation) are described in the following sections.
For this study, we used Python 3.7 programming language (https://www.python.org/) (accessed on 10 May 2020) along with a plethora of NLP libraries (e.g., scikit-learn, NLTK, spaCy, and textacy) to perform data reduction, tagging, extraction, and structuration. Scikitlearn provides a wide range of machine learning algorithms for classification, regression, clustering, and dimensionality reduction tasks. Similarly, NLTK, spaCy, and textacy are useful for analyzing natural language data. We primarily used scikit-learn for dimensionality reduction and NLTK, spaCy, and textacy for tokenization and part of speech tagging.

Results and Discussion
We tested the above approach by developing a structural ABM of household food security using semi-structured field interviews, for example, the excerpt in Figure 2. Our qualitative data contain 42 semi-structured interviews from different members (young and old, male and female) of farming households in Koutiala, Southern Mali. The interviews were initially conducted to develop mental models of household food security in the region [70]. Verbal consent was obtained from the participants prior to the interviews. The interviews were originally conducted in the local Bambara dialect and then translated into English for model development. The mental model development followed the lengthy conventional qualitative data analysis approach that used multiple coders and keywordbased sentence extraction. That inspired the research team to develop a more efficient alternative data processing and extraction approach for ABM development, presented here.
First, we grouped the interviews by the member types (i.e., elder male, younger male, elder female, and younger female) and analyzed the grouped narratives collectively. After preprocessing the interviews using NLTK tools, we used the scikit-learn Tfidf Vectorizer to reduce the volume of qualitative data. Textacy was primarily used for identifying candidate agents, actions, and attributes. Additionally, textacy extract (textacy.extract. semi-structured statements) was used in converting sentences to structured outputs. Finally, we manually filtered the unsupervised outputs based on their domain relevancy and ambiguity. The final outputs were then visualized and conceptualized using Gephi (https://gephi.org/) (accessed on 18 May 2020) and Lucid Chart (https://www.lucidchart.com/) (accessed on 21 May 2020) platforms ( Figure 3). First, we grouped the interviews by the member types (i.e., elder male, younger male, elder female, and younger female) and analyzed the grouped narratives collectively. After preprocessing the interviews using NLTK tools, we used the scikit-learn Tfidf Vectorizer to reduce the volume of qualitative data. Textacy was primarily used for identifying candidate agents, actions, and attributes. Additionally, textacy extract (textacy.extract. semistructured statements) was used in converting sentences to structured outputs. Finally, we manually filtered the unsupervised outputs based on their domain relevancy and ambiguity. The final outputs were then visualized and conceptualized using Gephi (https://gephi.org/) (accessed on: 18 May 2020) and Lucid Chart (https://www.lucidchart.com/) (accessed on: 21 May 2020) platforms ( Figure 3). As expected, the unsupervised tagging overpredicted the agents. Subjects that do not make decisions were also identified as candidate agents. To overcome the issue, we created an external database of action/decision verbs ( Table 1) that somewhat addressed the problem. Using the external database resulted in more than 60% reduction in the number of agents (e.g., Figure 4). The filtration process discarded the initially identified candidates, such as porridge, food, cereal, or farm. We also obtained multiple similar actions (synonyms). We used an external WordNet database to group semantically similar actions. The process resulted in a highly manageable and structured output for model conceptualization.
Next, we used the extracted information to develop UML class diagrams ( Figure 5) and contextual diagrams ( Figure 6). The diagrams revealed that different members of households support household food security differently. Male members of the households are generally involved in farming. They grow cereal crops and vegetables and are also into cash cropping, i.e., growing plants for selling on the market rather than subsistence farming to feed their families. Women principally look after household work and assist men in the fields. Households consume the food they produce. During food shortages, households seek help from their fellow villagers or buy food from the market. They use money obtained from cash crops to buy food.  As expected, the unsupervised tagging overpredicted the agents. Subjects that do not make decisions were also identified as candidate agents. To overcome the issue, we created an external database of action/decision verbs ( Table 1) that somewhat addressed the problem. Using the external database resulted in more than 60% reduction in the number of agents (e.g., Figure 4). The filtration process discarded the initially identified candidates, such as porridge, food, cereal, or farm. We also obtained multiple similar actions (synonyms). We used an external WordNet database to group semantically similar actions. The process resulted in a highly manageable and structured output for model conceptualization.

Decision Verbs
Action Verbs adhere abandon advise accelerate approve accept assess access  ated an external database of action/decision verbs ( Table 1) that somewhat addressed the problem. Using the external database resulted in more than 60% reduction in the number of agents (e.g., Figure 4). The filtration process discarded the initially identified candidates, such as porridge, food, cereal, or farm. We also obtained multiple similar actions (synonyms). We used an external WordNet database to group semantically similar actions. The process resulted in a highly manageable and structured output for model conceptualization.   obey analyze oblige apply plan argue Next, we used the extracted information to develop UML class diagrams ( Figure 5) and contextual diagrams ( Figure 6). The diagrams revealed that different members of households support household food security differently. Male members of the households are generally involved in farming. They grow cereal crops and vegetables and are also into cash cropping, i.e., growing plants for selling on the market rather than subsistence farming to feed their families. Women principally look after household work and assist men in the fields. Households consume the food they produce. During food shortages, households seek help from their fellow villagers or buy food from the market. They use money obtained from cash crops to buy food.  Additionally, household women are involved in small businesses that can support food purchases. For example, some households might need to rely on off-farm jobs or sell their livestock to buy food. Other organizations and credit agencies provide households with credits and support.
Following our framework, the conceptual model required evaluation. We applied model-to-model (M2M) comparison [71,72] and stakeholder validation. M2M involved comparing model output with the mental model of household food security developed using the same dataset, reported by [70]. We found that our approach captured all the essential components of household food security that were identified in the mental model.
Initially, we aimed to develop an efficient, bias-free, completely unsupervised information extraction for conceptualizing an ABM. However, after preliminary algorithm development, we realized that entirely unsupervised data processing and conceptualization is unrealistic with the current NLP capabilities. Therefore, we decided to use manual filtration and contextualization that potentially introduced subjectivity and biases in model development. We performed a stakeholder validation to address this deficiency to check for subjectivity and biases. Consequently, we converted the contextual model to a pictorial representation ( Figure 7) and brought it to the stakeholders (interviewees) for validation. Algorithms 2023, 16, x FOR PEER REVIEW 11 of 16 Additionally, household women are involved in small businesses that can support food purchases. For example, some households might need to rely on off-farm jobs or sell their livestock to buy food. Other organizations and credit agencies provide households with credits and support.
Following our framework, the conceptual model required evaluation. We applied model-to-model (M2M) comparison [71,72] and stakeholder validation. M2M involved comparing model output with the mental model of household food security developed using the same dataset, reported by [70]. We found that our approach captured all the essential components of household food security that were identified in the mental model.
Initially, we aimed to develop an efficient, bias-free, completely unsupervised information extraction for conceptualizing an ABM. However, after preliminary algorithm development, we realized that entirely unsupervised data processing and conceptualization is unrealistic with the current NLP capabilities. Therefore, we decided to use manual filtration and contextualization that potentially introduced subjectivity and biases in model development. We performed a stakeholder validation to address this deficiency to check for subjectivity and biases. Consequently, we converted the contextual model to a pictorial representation ( Figure 7) and brought it to the stakeholders (interviewees) for validation. The stakeholders positively evaluated the model and acknowledged that it included all the principal dynamics of the household food system. They, however, pointed out that the contextualized structure did not provide the dynamics of the government and nongovernment actors. Since the input data only contained interviews from farm households, we failed to capture the dynamics occurring outside of the households. Consequently, the model revealed data gaps where more information needs to be gathered on household food security's government and non-government actors.
The proposed unsupervised information extraction picked individual sentences based on their cumulative TFIDF weights. However, some of the individually extracted sentences lacked contextuality and were ambiguous. To add context and reduce this ambiguity, we used neighboring sentences during the unsupervised data extraction and processing phase ( Figure 1). We hypothesized that extracting a tuple of preceding and trailing sentences along with the identified sentence can provide vital contextual information; for example, some of the extracted sentences contained pronouns. These pronouns were impossible to resolve without the information in the preceding sentences. Therefore, extracting the preceding sentence should help in resolving their references.
The NLP also has a coreference resolution tool that automatically replaces pronouns with their referenced nouns. However, the tool is in development. We found that it generated too many errors that would require manual checks. Hence, we proceeded without using the tool, and the pronouns identified as agents were ignored. Algorithms 2023, 16, x FOR PEER REVIEW 12 of 16 Figure 7. Pictorial representation of the conceptual model presented in Figure 6.
The stakeholders positively evaluated the model and acknowledged that it included all the principal dynamics of the household food system. They, however, pointed out that the contextualized structure did not provide the dynamics of the government and nongovernment actors. Since the input data only contained interviews from farm households, we failed to capture the dynamics occurring outside of the households. Consequently, the model revealed data gaps where more information needs to be gathered on household food security's government and non-government actors.
The proposed unsupervised information extraction picked individual sentences based on their cumulative TFIDF weights. However, some of the individually extracted sentences lacked contextuality and were ambiguous. To add context and reduce this ambiguity, we used neighboring sentences during the unsupervised data extraction and processing phase (Figure 1). We hypothesized that extracting a tuple of preceding and trailing sentences along with the identified sentence can provide vital contextual information; for example, some of the extracted sentences contained pronouns. These pronouns were impossible to resolve without the information in the preceding sentences. Therefore, extracting the preceding sentence should help in resolving their references.
The NLP also has a coreference resolution tool that automatically replaces pronouns with their referenced nouns. However, the tool is in development. We found that it generated too many errors that would require manual checks. Hence, we proceeded without using the tool, and the pronouns identified as agents were ignored.
Using our framework, we only collected information on agents, attributes, and actions/interactions. However, ABM also requires information on agent decision-making. Although using social and behavioral theories in defining agent decision-making is predominant, empirically derived decision-making frameworks are context-specific and, therefore, more desirable when ABMs are applied in real-world situations [15,24]. We realize that some sentences are particularly useful in deriving agent decision-making. Specifically, conditional sentences such as 'if it rains, we plant maize' and compound sentences such as 'when production is low, we buy food from the market' can reveal decision- Using our framework, we only collected information on agents, attributes, and actions/interactions. However, ABM also requires information on agent decision-making. Although using social and behavioral theories in defining agent decision-making is predominant, empirically derived decision-making frameworks are context-specific and, therefore, more desirable when ABMs are applied in real-world situations [15,24]. We realize that some sentences are particularly useful in deriving agent decision-making. Specifically, conditional sentences such as 'if it rains, we plant maize' and compound sentences such as 'when production is low, we buy food from the market' can reveal decision-making. Harvesting these sentences with semantics and machine learning approaches can open new avenues for formulating empirically based decision-making rules for ABM.
It is important to note that the derived information is limited by the information contained in the input. For example, we noticed that agent tagging underpredicted agents after using the action and decision verbs. Entities such as 'father' and 'the government' should also be identified as agents of this particular system. However, some information was missed since subjects did not use both types of verbs (action and decision) in the provided interviews. Additionally, stakeholders pointed out that our model structure did not include the dynamics of the governmental and non-governmental actors. It prompts a need for a careful analysis of entities that failed to qualify as agents for data gaps. Furthermore, the interviews went through different translation stages (from local dialects to French and English) that could have corrupted some of their original meanings.

Conclusions
Complexities, ambiguities, and difficulties in data processing often discourage ABM developers from using qualitative data for model development, preventing modelers from using rich contextual information about their target systems. ABMs are often developed using ad-hoc approaches, potentially producing models that lack credibility and reliability. We introduced a systematic approach for ABM development from semi-structured qualitative interviews using NLP to address these gaps. The proposed methodology contained a largely unsupervised, domain-independent, efficient, and bias-controlled data processing and extraction approach aimed at ABM conceptualization. We demonstrated its effective-ness by developing an ABM of household food security from large open-ended qualitative field interviews.
Additionally, we outlined some of the significant limitations of the approach and recommended improvements for future development. Our framework is only relevant to data-driven models that focus on applications and address specific geographic regions and localities. It is not aimed at theory-driven modeling, which requires generalizable observations, where other methods, such as metamodeling, are more appropriate. It is also important to note that the proposed framework was developed only to handle information derived from text. Future improvements should focus on algorithms and tools combining text-derived and quantitative information using data analytics tools. Moreover, our framework requires further testing and experimentation, for example, contrasting it with alternative approaches, which is one of the objectives of our future research. Hopefully, since the NLP development community is highly active, these limitations will soon be resolved, making semantic and syntactic NLP more effective for unsupervised information extraction and model conceptualization.
Although we could not fully develop a completely unsupervised approach, we successfully managed to reduce subjectivity and biases by limiting data extraction manipulation. Data processing and extraction were fully unsupervised, and manual inputs were only required towards the end of model conceptualization, limiting the opportunities for introducing human bias in model development. Furthermore, the unsupervised approach was much faster compared with manual coding.

Data Availability Statement:
The results presented in this manuscript were extracted from unprocessed (i.e., original) data collected through open-ended interviews from individual households. The interviews contain confidential information such as geographic location, employment, ethnicity, household structure, relationships, and interactions among household members, or number of dependents. The Institutional Review Board (Michigan State University) approval restricted the data access only to the research team. All subjects who provided the interviews provided their informed consent for inclusion before participating in the study. Consequently, the data are protected under the rights of privacy.