Causal Pathway Extraction from Web‑Board Documents

: This research aim is to extract causal pathways, particularly disease causal pathways, through cause‑effect relation (CErel) extraction from web‑board documents. The causal pathways benefit people with a comprehensible representation approach to disease complication. A causative/ effect‑concept expression is based on a verb phrase of an elementary discourse unit (EDU) or a sim‑ ple sentence. The research has three main problems; how to determine CErel on an EDU‑concept pair containing both causative and effect concepts in one EDU, how to extract causal pathways from EDU‑concept pairs having CErel and how to indicate and represent implicit effect/causative‑concept EDUs as implicit mediators with comprehension on extracted causal pathways. Therefore, we ap‑ ply EDU’s word co‑occurrence concept (wrdCoc) as an EDU‑concept and the self‑Cartesian product of a wrdCoc set from the documents for extracting wrdCoc pairs having CErel into a wrdCoc‑pair set from the documents after learning CErel on wrdCoc pairs by supervised‑machine learning. The wrdCoc‑pair set is used for extracting the causal pathways by wrdCoc‑pair matching through the documents. We then propose transitive closure and a dynamic template to indicate and represent the implicit mediators with the explicit ones. In contrast to previous works, the proposed approach enables causal‑pathway extraction with high accuracy from the documents.


Introduction
The objective of this research is to extract causal pathways, particularly disease causal pathways, from downloaded disease documents from several Thai hospital web-boards. The causal pathway extraction of the research is based on determining a sequence of Cause-Effect pairs having a cause-effect relation (called 'CErel') from the documents where a Cause-Effect pair (called 'CEpair') is an ordered pair; Cause is a causative event/state concept; Effect is an effect event/state concept. According to Khoo [1], CErel is a semantic relation which is a directional link between concepts as entities that participate in the relation. Where the concepts connected by a relation are often represented as follow: <Concept1>-(Relation)-<Concept2> where the '< . . . >' and '( . . . )' symbols represent a concept and a relation type respectively. A dash line is a directional link between Concept1 and Concept2 and is labeled to indicate the type or meaning of the relation. With regard to our research, CErel as a cause-effect relation type is represented as follow: <CausativeConcept>-(CErel)-><EffectConcept> where CausativeConcept and EffectConcept are a causative concept and an effect concept respectively of either event or state occurrences on the documents. Moreover, Khoo [1] stated that "concepts and relations are the foundation of knowledge and thought while the concepts are the building blocks of knowledge and the relations are the cement linking up the concepts into the knowledge structures," e.g., a causal chain or a causal pathway contains CausativeConcept, EffectConcept and CErel to become the knowledge structure). With regards to Staplin et al. [2], the causal pathway in epidemiologic studies is a path starting at the exposure and ending at the disease that follows the direction of the arrows. All arrows of the causal pathway point in the same direction from the exposure toward the outcome, e.g., the disease occurrence is the outcome, [3]. According to Gaskell and Sleigh [3], the causal pathway as A → M j → B contains the exposure (A) which might cause the outcome (B) or through an intermediate process or variable called a mediator (M j ) where M j is a single mediator if j = 1; and M j is either sequential-mediators or multilevel-mediators if j = 2,3, . . . ,num which is an integer. However, our research concerns the extraction of the causal pathway with either a single mediator or sequential mediators mostly occurring on the documents. With regards to our disease document, A, M j , and B are either a causative event/state concept or an effect event/state concept which is mostly based on a verb phase of an elementary discourse unit (EDU) where an EDU is a simple sentence or a clause [4]. The EDU expression of the research is based on the general linguistic expression in Figure 1 after stemming words and the stop word removal. Where NP1 and NP2 are noun phrases; VP is a verb phrase; Noun is a noun concept set; Verb strong is a strong verb concept set consisting of causative-verb concepts and effect-verb concepts; Verb weak is a weak verb concept set requiring more information, i.e., Noun, to become the causative/effect concept, e.g., 'มี /have+ไขมั น/fat+สะสม/accumulate' ('have accumulated fat') and 'เป็ น/be+โรค/disease' ('get disease'); Adv is an adverb concept set; Adj is the adjective concept set; and Adjphrase is an adjective phrase.
where CausativeConcept and EffectConcept are a causative concept and an effect concept respectively of either event or state occurrences on the documents. Moreover, Khoo [1] stated that "concepts and relations are the foundation of knowledge and thought while the concepts are the building blocks of knowledge and the relations are the cement linking up the concepts into the knowledge structures," e.g., a causal chain or a causal pathway contains CausativeConcept, EffectConcept and CErel to become the knowledge structure). With regards to Staplin et al. [2], the causal pathway in epidemiologic studies is a path starting at the exposure and ending at the disease that follows the direction of the arrows. All arrows of the causal pathway point in the same direction from the exposure toward the outcome, e.g., the disease occurrence is the outcome, [3]. According to Gaskell and Sleigh [3], the causal pathway as A → Mj → B contains the exposure (A) which might cause the outcome (B) or through an intermediate process or variable called a mediator (Mj) where Mj is a single mediator if j = 1; and Mj is either sequential-mediators or multilevel-mediators if j = 2,3,…,num which is an integer. However, our research concerns the extraction of the causal pathway with either a single mediator or sequential mediators mostly occurring on the documents. With regards to our disease document, A, Mj, and B are either a causative event/state concept or an effect event/state concept which is mostly based on a verb phase of an elementary discourse unit (EDU) where an EDU is a simple sentence or a clause [4]. The EDU expression of the research is based on the general linguistic expression in Figure 1 after stemming words and the stop word removal. Where NP1 and NP2 are noun phrases; VP is a verb phrase; Noun is a noun concept set; Verbstrong is a strong verb concept set consisting of causative-verb concepts and effect-verb concepts; Verbweak is a weak verb concept set requiring more information, i.e., Noun, to become the causative/effect concept, e.g., 'มี /have+ไขมั น/fat+สะสม/accumulate' ('have accumulated fat') and 'เป็ น/be+โรค/disease' ('get disease'); Adv is an adverb concept set; Adj is the adjective concept set; and Adjphrase is an adjective phrase. The example of a causal pathway is expressed on the document by a sequence of Cause-Effect pairs (CEpairs) having CErel on the document is shown in Example 1.
The extracted causal pathways would support the problem-solving system by supporting unprofessional persons to have a more comprehensible approach to the disease complication through social media which results in compliance to the physician's suggestion of the appropriate treatment. Therefore, the research concerns extracting the causal pathways represented by the CEpair i sequences (where i is an index of a CEpair in a sequence) from the document.
There are several techniques [5][6][7][8][9][10][11][12][13] having been applied for determining or extracting the causal pathways or the causal chains through the cause-effect/causal relation determination between two entities/events from the documents (see Section 2). The features used for the CErel determination from the previous research are mainly version a noun variable pair or a verb variable pair from a noun phrase pair or a verb phrase pair respectively within one simple sentence or a simple sentence pair. Whilst the causative/effect concepts of the events/states on the Thai documents are mostly based on the EDUs' verb phrase expressions where the same verb phrase expressions with the different NP1 concepts have the different causative/effect concepts of the events/states, e.g., EDU1: "( ลิ ่ มเลื อด/Blood clots) NP1 (ไหลในเลื อด/flow in the artery)/VP" and EDU2:"( ไขมั น/Fat)/NP1 (ไหลในเลื อด/flow in the artery)/VP" have the flow(BloodClot, artery) and flow(Fat, artery) concepts respectively after stemming words and stop words removal. Therefore, the features used for the CErel determination of our research are based on composite variables relied on a predicate-argument term set for obtaining a causative/effect concept. Where a composite variable is a variable made up of two or more individual variables, called indicators, into a single variable [14]. Each indicator alone doesn't provide sufficient information, but altogether they can represent the more complex concept. In addition, the entailment classification of the previous research [15] is based on the similarity scores which cannot apply to our disease documents, e.g., the relation between EDU1: "ผู ้ ป่ วยเป็ นโรคหลอดเลื อดแข็ ง/The patient get arteriosclerosis" and EDU2: " เพราะไขมั นไปเกาะที ่ ผนั งหลอดเลื อดแดง/because fat deposits on the artery wall". is CErel with the similarity score approaching zero. Moreover, the actual causal pathway determination from texts of the previous research are mostly based on two steps of the cause-effect relation type, e.g., A causes B and B causes C, without concerning the implicit mediator whereas our disease documents contain several steps or more than two steps of the cause-effect relation type including the implicit mediators. With regard to the causal pathway for the problemsolving system, the implicit mediators on the causal pathway should be represented by the explicit mediators to have the complete causal pathway for understanding the mechanism through which the composite variable affects the outcome. However, the Thai documents have several specific characteristics, such as zero ana phora or the implicit noun phrase, without word and sentence delimiters, etc. All of these characteristics are involved in three main problems (see Section 3): (1) how to determine CErel on each EDU-concept pair from the documents where there are some EDU occurrences with both the causative concept in one CErel and an effect concept in another CErel; (2) how to extract causal pathways from several EDU-concept pairs having CErel; and (3) how to indicate the implicit mediators or the implicit effect/causative-concept EDUs on the correct extracted causal pathways from the documents for representing the implicit mediators in the form of the explicit effect/causative-concept EDUs or the explicit mediators for clear comprehension. Regarding these three main problems, we need to develop a framework which combines machine learning and the linguistic phenomena to represent each EDU occurrence by an EDU's word co-occurrence (called 'wrdCo') based on wrdCo-Pattern on Equation (1) relying on a predicate argument pattern of an EDU occurrence (see Figure 1) after stemming words and eliminating stop words. In addition, each EDU concept of an event/state is represented by an EDU's wrdCo concept (called 'wrdCoc ') as a feature or an element of a wrdCo concept set or a wrdCoc set (WC).
where V is a predicate verb set; V = Verb strong ∪V inf ; v a ∈ V; Each element of V inf consists of v weak,b +w inf,c (v weak,b ∈ Verb weak, w inf,c ∈ Noun, and w inf,c is a word right after v weak,b ; a, b, c, d, and e are an integer.; W1 is an agent argument set; w 1,d ∈ W1; w 1,d is a head noun or a Noun element of NP1 and w 1,d is a Noun element of the previous EDU's NP1 if the current EDU's NP1 is ellipsis; W2 is a linguistic patient/information set; w 2,e ∈ W2; W2 = Noun∪Adv∪Adj and w 2,e is also a word sequence right after v a ; w 2,e has a null value if w 2,e doesn't exist; And all Verb strong , Verb weak , Noun, Adv, Adj sets are based on Figure 1) Likewise, three contributions of this paper are statistically-based approaches involved with linguistic phenomena and machine learning. The first one is that each wrdCoc feature used for the CErel determination by machine learning is the composite-variable consisting of the elements of V, W1, and W2 for the causative concept/effect concept representation. The second one is that our extracted causal pathways contain more than two steps of the cause-effect relation type and are the actual causal pathways from the documents. And the third one is that some extracted causal pathways contain the implicit mediators (or the implicit effect/causative-concept EDUs) as the implicit wrdCoc features which have to be represented by the explicit wedCoc features for clear comprehensible pathways. Moreover, our implicit wrdCoc features the qualitative data whereas the previous research [16] discovered the hidden semantics or the latent semantics as the implicit features by the graph regularization where the latent semantics of [16] is the quantitative data.
We then apply the self-Cartesian product of WC × WC [17] (the first WC as the causative-concept set, the second WC as the effect-concept set) to a test corpus for extracting wrdCoc pairs having CErel into WCP (WCP is a set of wrdCoc pairs having CErel) after learning CErel on wrdCoc pairs by naïve Bayes (NB) [18], support vector machine (SVM) [19], and logistic regression (LR) [20] from a learning corpus. According to the test corpus, all WC elements are determined by wrdCo-expression matching between wrdCo expressions of the test corpus and wrdCo expressions of the semi-automatic annotated corpus having annotated wrdCoc features. WCP is used for extracting the causal pathways through wrdCoc-pair matching on the documents (see Sections 3.1 and 3.2). We then propose using transitive closure of a binary relation over a causative concept set and an effect concept set [21] to indicate the implicit mediator occurrences on the correct extracted causal pathways and also using a dynamic template to collect the correct extracted causal pathways with the explicit mediators used for representing those implicit mediators (where the explicit mediators are the explicit effect/causative-concept EDUs represented by EDUs'wrdCoc features) (see Section 3.3).
Our research is organized into six sections. In Section 2, related work is summarized. Problems in the causal-pathway extraction from the documents are described in Sections 3 and 4 shows our framework for the causal pathway extraction from the documents. In Section 5, we evaluate our proposed model including discussion and then present a conclusion in Section 6.

Related Works
Several strategies [5][6][7][8][9][10][11][12][13] have been proposed to determine/extract a causal pathway, a causal chain, or causal path of a graph/network through the cause-effect/causal relation determination except [13] working on the implicit knowledge completion where [5,6] working on only the causal/cause-effect relation determination from texts. Girju [5] proposed decision-tree learning the causal relation from a sentence based on the lexico-syntactic pattern (NP1 causal-verb NP2) where NP1 is a cause and NP2 is an effect or vice versa. Cao et al. [6] also used syntactic patterns by manually annotating one sentence or between two sentences having a cue (a word or a phrase) as a cause-effect link to express the causeeffect relation which is the core of scientific papers. Their cause-effect links were extracted by a syntactic pattern-based algorithm from scientific papers with 47% and 70% on average precision and recall respectively. Chang and Choi [7] extracted causality/causal relation with an F-score of 77.37% based on one complex sentence or two simple sentences by using a cue-phrase set to connect two noun phrases (or an NP pair) as a cause and an effect including probabilities. The extracted causal relations were used for constructing the causal paths of the causal network for the term protein having two relations; the causal relation and the hypernym relation. Pechsiri and Piriyakul [8] applied verb-pair rules resulting from machine learning techniques to extract the causality or the cause-effect relation from several simple sentences to construct one cause with several effect paths on an explanation knowledge graph. The cause-effect paths of [8] were emphasized on the consequence or concurrent occurrence of the extracted effect events. Whilst a causal chain [9] was generated by connecting the extracted causal relations with sentence's word similarity and topic matching between a causative sentence of one causal relation and an effect sentence of another causal relation where the causal relation extraction was based on clue words. Kang et al. [10] applied the Granger causality model with features, i.e., N-words, topics, sentiments, etc., to detect cause-effect relationships from texts for a time series. And [10] also applied a neural reasoning algorithm based on human annotation along with BLEU (bilingual evaluation understudy) scores used for measuring the connection of two causeeffect relationships to construct a causal chain with 57% accuracy based on expert judgments. However, the cause/effect events or entities [10] are mainly expressed by noun phrases based on day-by-day time series. Izumi and Sakaji [11] applied a causal verb set as the edge/relation to construct causal paths by connecting between a cause node and an effect node expressed by noun phrases within one sentence. The causal chain was constructed by manually selecting word vector similarities between effect nodes and cause nodes from different causal relations. Nordon et al. [12] extracted several causal relations based on the lexico-syntactic pattern [5], and then applied the text analysis, i.e., word cooccurrence and Word2vec, to determine the edge weights for solving each causal path of the causal graph from the extracted causal relations. Moreover, Ref. [13] applied the similarity score between two word-pairs as an event pair including the notified event location to calculate the event relevant for automatically discovering implicit event knowledge occurring among the sequential event chain of actions from a Japanese web blog corpus without the CErel consideration between the event pair, e.g., "roll on the floor", "sitting on a sofa", and "drinking tea" were the event chain of actions with the Living room location added. [13] evaluated the knowledge completion for the chain of actions (including the notified event location) by the graduate students scoring as 3.0 based on a five-point Likert scale.
Therefore, the causal relation determination of the previous research [5][6][7][8][9][10][11][12] is mostly based on noun phrases within one or two simple sentences except [8] using only verb pairs to extract the causal relation from several simple sentences. However, CErel of our research is based on wrdCoPattern on Equation (1) included an NP1 head noun and an EDU's verb phrase because the different agents (NP1) with the same predicate verb provide the different semantics of causative/effect concepts. The causal pathways, the causal chain, or the causal-graph paths of the previous research [7][8][9][10][11][12] are determined/extracted from documents without concerning the implicit mediator on the certain path/chain whilst [13] emphasizes the implicit knowledge completion on the event chain of actions without the CErel consideration. However, there are a few works on extracting the causal pathway from texts with little concerning in the implicit mediator.

How to Determine CErel on an EDU-Concept Pair/a wrdCoc Pair
According to the corpus behavior study of the medical care domain, most of the causative/effect-concept EDUs are the events or states expressed by verb phrases. There are some verb phrase expressions with both the causative concepts and the effect concepts on the documents as shown in Example 1, e.g., EDU2 is an effect-event concept and a causative-event concept for CEpair 1 and CEpair 2 respectively where CEpair 1 and CEpair 2 are consecutive. Moreover, lack of the sentence delimiter in the Thai documents causes a problem of determining EDU's concept pairs (e.g., an EDU1-EDU2 pair or an EDU2-EDU3 pair) having CErel from three consecutive EDUs if the second EDU contains a discoursemarker cue set, {'เพราะ/because', 'เนื ่ องจาก/since', . . . }, as shown in Example 2. Where each EDU concept is represented by wrdCoc, i.e., an EDU j concept is represented by EDU j 's wrdCoc called wrdCoc EDUj , j is an integer.

Example 2.
Topic Name: โรคเบาหวาน/Diabetes . . . Example 2 contains a CEpair i with CErel as shown in the following: wrdCoc EDU2 -wrdCoc EDU3 Pair asCEpair 1 :wrdCoc EDU2 <Cause>-(CErel)-> <Effect> wrdCoc EDU3 . Therefore, we apply the self-Cartesian product of WC × WC having the first WC as a causative concept set and the second WC as an effect concept set to the test corpus for extracting the wrdCoc pairs of an EDU pairs having CErel into WCP after learning CErel on each wrdCoc pair by NB, SVM, and LR from the learning corpus (see Section 4.2). The WC elements are collected by the wrdCo-expression matching between the wrdCo expressions of the test corpus and the wrdCo expressions of the learning corpus with the semi-automatic annotated wrdCoc features (see Section 4.1).

How to Extract the Causal Pathways
During the causal pathway extraction, some causal pathways mingle with non-causa tive/effect concept EDU(s) and remain a challenge, e.g., Example 1 mingles with a noncausative/effect concept EDU(s) such as EDU2: " อิ นซู ลิ นมี หน้ าที ่ ส่ งสั ญญาณให้ เซลน้ าน้ าตาลไปใช/Insulin has a function of signaling cells to take sugar for use". which intervenes after EDU1 of Example 1 as shown in the following:

EDU1. "เมื ่ อผู ้ ป่ วยขาดอิ นซู ลิ น/When a patient lacks insulin,"
wrdCoc EDU1 = lack(person,insulin) EDU2. " อิ นซู ลิ นมี หน้ าที ่ ส่ งสั ญญาณให้ เซลน้ าน้ าตาลไปใช/Insulin has a function of signaling cells to take sugar for use". Each wrdCoc feature having a predicate verb v a ∈ Verb strong ∪V inf on the test-corpus document is sequentially collected into an array of wrdCoc features for the causal pathway extraction. Where all wrdCoc features of the test-corpus document are obtained by the wrdCo-expression matching between the wrdCo expressions of the test-corpus document and the wrdCo expressions of the annotated corpus having the annotated wrdCoc features.
Therefore, we apply WCP to extract each causal pathway by the wrdCoc-pair matching on sliding window size of two consecutive wrdCoc features (or a wrdCoc pair) on the array of wrdCoc features to match among WCP elements with one wrdCoc distance through the array. If there is no match on wrdCoc-pair matching, we will stop sliding the window and then obtain a causal pathway (Section 4.4).

How to Indicate Implicit Mediators for Explicit Mediator Representation
Some determined causal pathways contain the implicit mediators (the implicit effect/ causative-concept EDUs) as in Examples 3-4 of the same disease group.

EDU1. "ถ้ าระดั บน้ ำตาลในเลื อดสู งเกิ ดขึ ้ นเป็ นระยะเวลานาน/If hyperglycaemia occurs for a long-term,"
ถ้ wrdCoc EDU1 -wrdCoc EDU2 Pair as CEpair 1 : wrdCoc EDU1 <Cause>-(CErel)-> <Effect> wrdCoc EDU2 wrdCoc EDU2 -wrdCoc EDU3 Pair as CEpair 2 : wrdCoc EDU2 <Cause>-(CErel)-> <Effect> wrdCoc EDU3 wrdCoc EDU3 -wrdCoc EDU5 Pair as CEpair 3 : wrdCoc EDU3 <Cause>-(CErel)-> <Effect> wrdCoc EDU5 According to Example 4, the causal pathway of Figure 3, particularly in a dash-line square, contains EDU2 as an implicit mediator between EDU2 and EDU3 in another dashline square of the causal pathway in Figure 2. Whilst EDU4 of the causal pathway in Figure 2 is another implicit mediator between EDU3 and EDU5 of the causal pathway in Figure 3. Where the chronic kidney disease and the kidney failure have the same concept of the kidney deterioration. Therefore, we propose using TransCEPair (which is a set of CEpairs having CErel to be transitive (which is equivalent to an implicit mediator) from Transitive Closure of the binary relation over all correct extracted causal pathways) to indicate the implicit mediators on each correct extracted causal pathway and using the following ExplicitCEpairWithCErelPathways template as the dynamic template to collect the extracted causal pathways with the explicit mediators represented by EDUs' wrdCoc features used for representing the implicit mediators (see Section 4.5).

Dynamic ExplicitCEpairWithCErelPathways Template:
wrdCoc EDUj -wrdCoc EDUj+1 Pair as CEpair p1 :wrdCoc EDUj <Cause>-(CErel)-><Effect>wrdCoc EDUj+1 wrdCoc EDUj+1 -wrdCoc EDUj+2 Pair as CEpair p2 :wrdCoc EDUj+1 <Cause>-(CErel)-><Effect>wrdCoc EDUj+2 According to Example 4, the causal pathway of Figure 3, particularly in a dash-line square, contains EDU2 as an implicit mediator between EDU2 and EDU3 in another dashline square of the causal pathway in Figure 2. Whilst EDU4 of the causal pathway in Figure 2 is another implicit mediator between EDU3 and EDU5 of the causal pathway in Figure 3. Where the chronic kidney disease and the kidney failure have the same concept of the kidney deterioration. Therefore, we propose using TransCEPair (which is a set of CEpairs having CErel to be transitive (which is equivalent to an implicit mediator) from Transitive Closure of the binary relation over all correct extracted causal pathways) to indicate the implicit mediators on each correct extracted causal pathway and using the following ExplicitCEpairWithCErelPathways template as the dynamic template to collect the extracted causal pathways with the explicit mediators represented by EDUs' wrdCoc features used for representing the implicit mediators (see Section 4.5).
where numberCP is the number of correct extracted causal pathways.

System Overview
There are five steps in our framework, Corpus Preparation, CErel Learning on Each wrdCoc Pair, Determination of wrdCoc Pairs Having CErel, Causal Pathway Extraction, and Implicit-Mediator Indication and Representation with Explicit-Mediator from Dynamic Template as shown in Figure 5.

System Overview
There are five steps in our framework, Corpus Preparation, CErel Learning on Each wrdCoc Pair, Determination of wrdCoc Pairs Having CErel, Causal Pathway Extraction, and Implicit-Mediator Indication and Representation with Explicit-Mediator from Dynamic Template as shown in Figure 5.

Word and EDU Segmentations
This step is to prepare an EDU corpus from disease-explanation documents downloaded from several hospital web-boards (http://haamor.com/;http://www.bangkok health.com; http://www.si.mahidol.ac.th/sidoctor/e-pl/; https://www.bumrungrad.com; etc. accessed on 10 August 2021). The step involves using Thai word-segmentation tools [23] and Named-Entity recognition [24,25]. After the word segmentation is achieved, EDU Segmentation [26,27] is then operated to provide an 8000 EDU corpus (consists of 4000 EDUs from a diabetes and kidney disease group and 4000 EDUs from a heart and artery disease group). This 8000 EDUs' corpus is separated into 2 parts after stemming words and the stop word removal. The first part (which consists of 2000 EDUs from the diabetes and kidney disease group and 2000 EDUs from the heart and artery disease group) is the corpus for semi-automatic annotations of the wrdCo concepts (as the wrdCoc features) and the relation-class of each wrdCoc pair by the experts on the next step of Section 4.

Word and EDU Segmentations
This step is to prepare an EDU corpus from disease-explanation documents downloaded from several hospital web-boards (http://haamor.com/;http://www.bangkokhealth. com; http://www.si.mahidol.ac.th/sidoctor/e-pl/; https://www.bumrungrad.com; etc. accessed on 10 August 2021). The step involves using Thai word-segmentation tools [23] and Named-Entity recognition [24,25]. After the word segmentation is achieved, EDU Segmentation [26,27] is then operated to provide an 8000 EDU corpus (consists of 4000 EDUs from a diabetes and kidney disease group and 4000 EDUs from a heart and artery disease group). This 8000 EDUs' corpus is separated into 2 parts after stemming words and the stop word removal. The first part (which consists of 2000 EDUs from the diabetes and kidney disease group and 2000 EDUs from the heart and artery disease group) is the corpus for semi-automatic annotations of the wrdCo concepts (as the wrdCoc features) and the relation-class of each wrdCoc pair by the experts on the next step of Section 4.1.2 where this annotated corpus is used as a learning corpus in Section 4.2. The second part is a test corpus which consists of 2000 EDUs from the diabetes and kidney disease group and 2000 EDUs from the heart and artery disease group. The test corpus of each disease group is used for (1) determining and collecting wrdCoc pairs as CEpairs having CErel into WCP in Section 4.3 and (2) extracting the causal pathways in Section 4.4.

Semi-Automatic Corpus Annotation
The semi-automatic corpus annotation of each disease group consists of the wrdCoc feature annotation on the wrdCo expressions and the CErel annotation on each wrdCoexpression pairs. We semi-automatically annotate the corpus by using an element of a discourse-marker cue set, {'ทำให้ /causing', 'เพราะ/because', 'เนื ่ องจาก/since'}, to anchor on the corpus documents for obtaining predicate verbs, v a , (v a ∈ z∪V inf ; a = 1, 2, . . . , numofPredicat-eVerbs) from all EDU occurrences right before and right after the anchored causal-cue set elements. Then we obtain a V-pair set = the result of the self-Cartesian product (V × V). We use all V-pair set elements to search two adjacent predicate verbs (v a1 v a2 where v a1 ,v a2 ∈ V; v a1 <>v a2 ; a1<>a2) of two adjacent EDU occurrences along with automatically annotating the v a , w 1,d , and w 2,e terms of two adjacent EDUs' wrdCo expressions for the wrdCo concept annotation as the wrdCoc features by the experts selecting the concepts from Lexitron Dictionary after the Thai-to-English translation. Where the concepts from Lexitron Dictionary are referred to Thai Encyclopedia (https://www.saranukromthai.or. th/index2.php), MeSH (https://www.ncbi.nlm.nih.gov/mesh accessed on 10 August 2021), and Wordnet [28] (http://word-net.princeton.edu/obtain accessed on 10 August 2021). Additionally, the relation class (CErel/nonCErel) between two annotated wrdCoc features as a wrdCoc pair (or CEpair) on the annotated corpus is also annotated by the expert as shown in Figure 6 for learning the relation-class in Section 4.2. Both the wrdCo expressions and the wrdCoc features from both disease groups are collected into wrdCo-Concept Table  (see Table 1) containing several wrdCo expressions with the same wrdCoc feature where the duplicate entries are eliminated.

CErel Learning on Each wrdCoc Pair
The objective of this step is CErel learning on each wrdCoc pair (which is a wrdCoc EDU pair as CEpair) with the CErel/nonCErel class from the annotated corpus used as the learning corpus to obtain WCP of each disease group in the next section. Regarding the annotated corpus of each disease group from Section 4.1, each annotated corpus contains several EDUs with the wrdCoc-pair class annotations by the wrdCocPair tag. The wrdCoc features of each disease group, e.g., a CwrdCoc feature and a EwrdCoc feature (where CwrdCoc is a causative wrdCo concept; EwrdCoc is an effect wrdCo concept), are obtained by the wrdCo tag containing 'Concept' and 'type' (Figure 6). All annotated wrdCoc pairs as CwrdCoc, EwrdCoc pairs with CErel/nonCErel by the wrdCocPair tag of each disease group are used for learning CErel by NB, SVM, and LR based on ten-fold cross validation.
(a) NB [18]. The NB learning results of each disease group by this step based on using Weka (http://www.cs.wakato.ac.nz/ml/weak/ accessed on 10 August 2021) are the probabilities of CErel and nonCErel of CwrdCoc features and EwrdCoc features in wrdCoc pairs as shown in Table 2. Where CwrdCoc ∈ CWC which is a causative-wrdCo-concept set; EwrdCoc ∈ EWC which is an effect-wrdCo-concept set; and CWC∩EWC̸ =∅.
(b) SVM [19]. The SVM learning is a linear binary classification applied to classify the CErel and nonCErel of each wrdCoc pairs from the annotated corpus by using Weka. This linear function, f (x), of the input x = (x 1 , x 2 , . . . , x n ) assigned to the Cerel class if f (x) > 0, and otherwise to the nonCErel class, is as Equation (3).
where x is a dichotomous vector number, w is weight vector, b is bias, and (w, b) ∈ R n × R are the parameters that control the function. The SVM learning is to determine w j and b for each wrdCoc feature (x j ) which is either a CwrdCoc feature or a EwrdCoc feature in each wrdCoc pair with CErel or nonCErel from the annotated corpus of each disease-group.
tionally, the relation class (CErel/nonCErel) between two annotated wrdCoc feat wrdCoc pair (or CEpair) on the annotated corpus is also annotated by the expert in Figure 6 for learning the relation-class in Section 4.2. Both the wrdCo express the wrdCoc features from both disease groups are collected into wrdCo-Conce (see Table 1) containing several wrdCo expressions with the same wrdCoc featu the duplicate entries are eliminated. Figure 6. Annotation of wrdCo concepts or wrdCoc features including CErel/nonCErel tween wrdCoc pair where v, w1, and w2 symbols in a wrdCo tag is va, w1,d, and w2,e terms respectively of wrdCoPattern.  Figure 6. Annotation of wrdCo concepts or wrdCoc features including CErel/nonCErel Class between wrdCoc pair where v, w1, and w2 symbols in a wrdCo tag is v a , w 1,d , and w 2,e terms/elements respectively of wrdCoPattern. (c) LR [20]. The logistic regression model of the research is based on the linear logistic regression with binary vector data. The distinguishing feature of the logistic regression model is that the variable is binary or dichotomous. Usually, the input data with any value from negative to positive infinity would be used to establish which attributions are influential in predicting the given outcome with values between 0 and 1, and hence is interpretable as a probability. The logistic function can be written as: F(x) is interpreted as the probability of the given outcome to be predicted where x 1 and x 2 are attribute variables; ß 0 is bias; and ß 1 , and ß 2 are the model estimators which play the role of momentum for each attribute. The LR learning is to determine ß 0 , ß 1 , and ß 2 for each CwrdCoc feature and each EwrdCoc feature as x 1 and x 2 features respectively in each wrdCoc pair (CwrdCoc, EwrdCoc) with either the positive/CErel class or the negative/nonCErel class formed by supervised learning on the learning corpus of each disease-group. The learning results by NB, SVM, and LR models are the estimators which are used for determining wrdCoc pairs having CErel from the test corpus of each disease group in the next step of Section 4.3. Moreover, all precisions of learning by NB, SVM, and LR from the learning corpus of each disease group are greater than 0.8.

Determination of wrdCoc Pairs Having CErel
The WC elements are determined from all wrdCo expressions on the test corpus of each disease group by the wrdCo-expression matching between the wrdCo expressions on this test corpus and the wrdCo expressions on wrdCo-Concept Table (Table 1) to obtain the wrdCoc features or the WC elements. The result of the self-Cartesian product (WC × WC) is a wrdCo-concept ordered pair set which is used for determining and collecting the wrdCoc pairs having CErel into WCP of each disease group by the following NB, SVM, and LR.

Causal Pathway Extraction
All wrdCoc features per the test-corpus document of each disease group are sequentially collected in an array of wrdCoc features (wcc [ ]) after the wrdCo-expression matching between the wrdCo expressions of this test-corpus document and the wrdCo expressions on wrdCo-Concept Table (Table 1) to obtain the wrdCoc features. The causal pathways are then extracted by the wrdCoc-pair matching between wcp k (wcp k ∈ WCP; k = 1, 2, . . . , numberOfWCPelements) and each wrdCoc pair in wcc [ ] as a wcc [ct] wcc [ct+1] pair (ct = 1, 2, . . . , numberOFwrdCocFromTestCorpusDocument) by sliding a window size of two consecutive wrdCoc features (wcc [ct] wcc [ct+1] ) with one wrdCoc distance (wcc [ct++] ) on wcc [ ]. We stop sliding the window if there is no match on wrdCoc-pair matching. We then obtain a causal pathway as shown in Algorithm 1 where CEpair i is a wrdCoc pair (wcc [ct] wcc [ct+1] ) in wcc [ ]; and allPathways (which is an array of arrayList with an 'a' variable of an array size) contains several causal pathways.

Implicit-Mediator Indication and Representation with Explicit-Mediators
The correct extracted causal pathways of the allPathways result for each disease group by the CausalPathwayExtraction algorithm on Section 4.4 consists of the explicit mediator causal pathways and the implicit mediator causal pathways. The allPathways result of each disease group also contains some duplicate causal pathways. Therefore, allPathways is sorted and then is eliminated the duplicate causal pathways to become PathWays (which is an array of arrayList with an updated 'a' variable) before indicating the implicit mediators on the correct extracted causal pathways. With regard to PathWays of each disease group, the causal pathways containing the explicit mediators represented by EDUs' wrdCoc features are collected into the dynamic template as the ExplicitCEpairWithCErel-Pathways template (see Section 3.3) which is an ExplicitPath variable in an ExplicitCausal-PathwayRepresentation algorithm (Algorithm 2) whilst the causal pathways containing the implicit mediators are collected into an ImplicitPath variable as a temporary template.

Algorithm 1 Causal Pathway Extraction
CAUSAL_PATHWAY_EXTRACTION /* (Extraction of several CEpairi sequences as causal pathways.) /* Assume that each EDU is represented by (NP1 VP). /* L is a list of EDUs from one test-corpus document after stemming words and the stop word removal. /* CEpairi is a wrdCoc pair with index i of the causal pathway. /* wcc[ ] is an array of wrdCoc and is collected from this test corpus. /* WCP is a set of wrdCoc pairs having CErel. If wcexp j .v∈Vstrong ∪ Vinf then /* wcexp j .v is a predicate verb v a on a wrdCo expression with index j. Table (Table 1) Figure 7) with eliminating the duplicate causal pathways. /* trsvSet is TransCEPair which is a set of CEpairs with CErel to be transitive. /* ExplicitPath is a dynamic ExplicitCEpairWithCErelPathways template. while (i ≤ Pathways [α] .numberOfCauseEffectConceptPairs) ∧ (Pathways [α] .Get(CEpair i )TrsvSet) do /*add explicitCEpair i to ExplicitPath. 9:

Evaluation and Discussion
The test corpus of 4000 EDUs employed to evaluate the proposed methodology for the causal pathway extraction through determining wrdCoc pairs having CErel is collected from the downloaded disease documents on Thai hospital web-boards. The test corpus consists of 2000 EDUs from the diabetes and kidney disease group documents and the 2000 EDUs from the heart and artery disease group documents. There are three evaluations, 1) the determination of wrdCoc pairs having CErel, 2) the causal pathway extraction, and 3) the implicit-mediator indication and representation with the explicit mediators from the dynamic template.

Evaluation and Discussion
The test corpus of 4000 EDUs employed to evaluate the proposed methodology for the causal pathway extraction through determining wrdCoc pairs having CErel is collected from the downloaded disease documents on Thai hospital web-boards. The test corpus consists of 2000 EDUs from the diabetes and kidney disease group documents and the 2000 EDUs from the heart and artery disease group documents. There are three evaluations, (1) the determination of wrdCoc pairs having CErel, (2) the causal pathway extraction, and (3) the implicit-mediator indication and representation with the explicit mediators from the dynamic template.

Determination of wrdCoc Pairs Having CErel
The evaluation results of extracting the EDU-concept pairs/wrdCoc pairs having CErel from the documents of the diabetes and kidney disease group and the heart and artery disease group are the precisions and the recalls based on three experts with max win voting as shown in Table 3 including the number of different wrdCoc features which results in the frequencies of wrdCoc features as shown in Figure 8.   8. Show the wrdCoc frequencies with the causative concepts and the effect concepts from the diabetes and kidney disease group and the heart and artery disease group.
From Table 3, the average precisions of extracting wrdCoc pairs having CErel from the documents of the diabetes and kidney disease group and the heart and artery disease group are 0.871 and 0.833 respectively, with the average recalls of 0.791 and 0.753 and the average F-score of 0.830 and 0.791 respectively. Whereas the causality or cause-effect relation extraction from the previous research [7] based on the probabilities of words on NP pair and the cue phrase probability from the complex sentence or two simple sentences from the medical domain has an F-score of 0.774. With regard to our research results on

Heart &Artery Disease Group
EffectConceptFreq Figure 8. Show the wrdCoc frequencies with the causative concepts and the effect concepts from the diabetes and kidney disease group and the heart and artery disease group.
From Table 3, the average precisions of extracting wrdCoc pairs having CErel from the documents of the diabetes and kidney disease group and the heart and artery disease group are 0.871 and 0.833 respectively, with the average recalls of 0.791 and 0.753 and the average F-score of 0.830 and 0.791 respectively. Whereas the causality or cause-effect relation extraction from the previous research [7] based on the probabilities of words on NP pair and the cue phrase probability from the complex sentence or two simple sentences from the medical domain has an F-score of 0.774. With regard to our research results on Table 3, the reason for the diabetes and kidney disease group having the higher precision and recall of extracting wrdCoc pairs having CErel than the heart and artery disease group is that the heart and artery disease group have more diversity of the wrdCoc features in both the causative concepts and the effect concepts than the diabetes and kidney disease group. The high diversity of wrdCoc features results in low frequencies of most wrdCoc features. In addition, there are some dependency occurrences among wrdCoc features, e.g., EDU1: 'haveHyperglycaemia(person)' → EDU2:'becomeInflamed(BloodVessel)', where EDU1 ′ s wrdCoc and EDU2 ′ s wrdCoc mostly occur as a cause-effect relation or a dependency occurrence on documents but some documents contain EDU1 ′ s wrdCoc followed by EDU2 ′ s wrdCoc without the cause-effect relation. Thus, the wrdCo diversity and the wrdCoc dependency result in SVM having highest precision in both the diabetes and kidney disease group and the heart and artery disease group. However, both the diabetes and kidney disease group and the heart and artery disease group have low recalls because the diversity of wrdCo expressions occurs on the downloaded documents of both disease groups.

Causal Pathway Extraction
The causal pathway extraction from the test corpus is evaluated by the precision and recall based on three experts with max wins voting as shown in Table 4. The causal pathways determination from the documents of two disease group as shown in Table 4 have an average precision of 0.834 with the average recall of 0.715. The reason for having low recall of extracting causal pathways from the documents is that some causal pathways start with EDUs containing the causative/effect concept expressed by either NP1 or NP2 as shown in the following EDU1 of Example 5 instead of the predicate verb or Verb on the general linguistic expression in Figure 1. However, the evaluation of the previous work [10] on extracting and constructing the causal chain/pathways from a large corpus on an on-line social media (tweets, news articles, and blogs) relied on time series through prediction of noun phrases as the next effect is 57% accuracy based on expert judgments whereas our causal pathways relied on the actual events/states with the causative/effect concepts.

Implicit-Mediator Indication and Representation with Explicit-Mediators
We evaluate the implicit mediator indication and representation with the explicit mediators (from the dynamic template) in term of a Likert scale (1 to 5) for concise and comprehensible representations of the correct extracted causal pathways. The evaluation results with the average scores (based on the Likert scale) of the concise and comprehensible representations of Doc (which is the causal pathway representation by explanation on the documents) and Graph (which is the causal pathway representation by the correct extracted causal pathway with the explicit mediators from the documents) by the 30 end-users (who are non-professional persons) are presented on Table 5 and Figure 9 of both disease groups.
From Table 5 and Figure 9, Graph Representations of both disease groups have higher concise and higher comprehensible representations than Doc Representations of both disease groups. Moreover, from Table 5, the average scores of the concise representation and the comprehensible representation by Graph Representations from both disease groups are 4.4 and 4.25 respectively whereas the evaluation of the implicit knowledge completion on the event chain of actions from the Japanese web-blog corpus of the previous work [13] based on the similarity scores of event pairs on the chain with the higher than thresholds is 3.0 (based on the Likert scale 1-5) without the CErel consideration between event-pairs on the chain.
According to the evaluation of the comprehensible representation of our research on both disease group, there are a few causal pathways requiring more explicit mediators for more clear representation as shown in Example 6.   From Table 5 and Figure 9, Graph Representations of both disease groups have higher concise and higher comprehensible representations than Doc Representations of both disease groups. Moreover, from Table 5, the average scores of the concise representation and the comprehensible representation by Graph Representations from both disease groups are 4.4 and 4.25 respectively whereas the evaluation of the implicit knowledge completion on the event chain of actions from the Japanese web-blog corpus of the previous work [13] based on the similarity scores of event pairs on the chain with the higher

Conclusions
In this paper, we presented the extraction of the causal pathways containing the explicit and/or implicit mediators through learning and determining the wrdCoc pairs or CEpairs having CErel from the downloaded documents of the diabetes and kidney disease group and the heart and artery disease group on the Thai hospital web-boards. Where each explicit mediator is expressed on the document by an effect/causative-concept EDU represented by an EDU's wrdCoc feature. We also represent the implicit mediators by the explicit mediators within the correct extracted causal pathways. With regard to the limited literation of the causal pathway extraction from texts, the extracted causal pathways including the explicit mediator representation of our research supports the preliminary causal inference and also makes non-professionals understand an etiological pathway including disease complication through the social media for the compliance to the preventive treatments. Our proposed method of extracting and representing the causal pathways in terms of the explicit mediators even the implicit mediator occurrences from the documents is based on (1) the wrdCoc-pair matching between the wrdCoc pairs on the test corpus and the WCP elements through the sliding window on the test corpus for the causal pathway extraction where each wrdCoc feature is obtained by the wrdCo-expression matching between the wrdCo expressions on the test corpus and the wrdCo expressions on the wrdCo-Concept Table. In addition, the wrdCoc features from the wrdCo-expression matching are based on the v a , w 1,d , and w 2,e terms with complete matching as in [29]. Since the precisions of determining wrdCoc pairs having CErel from the learning corpus and the test corpus are consistent, the causal pathways extracted by the wrdCoc-pair matching are strengthened (where the WCP elements obtained by the correct determination of wrdCoc pairs having Cerel) And (2) applying the transitive closure to obtain TransCEPair for indi-cating the implicit mediators on the correct extracted causal pathways to represent these implicit mediators with the explicit ones from the dynamic template. To evaluate the proposed method, the accuracy of determining the wrdCoc pairs having CErel depends on both the diversity of the wrdCoc features (including the diversity of wrdCo expressions) and the dependency between wrdCoc features; which later affect to the causal pathway extraction and representation. In contrast to the previous researches, our proposed method provides three contributions: (1) the CErel or cause-effect relation determination with high F-scores of our research is based on a wrdCoc pair for representing an event concept pair expressed by two EDU's verb phrases with the NP1 head noun consideration whereas the cause-effect relation determinations of the previous researches are based on either the NP1-NP2 pairs [5][6][7][9][10][11][12] within one/two simple sentences or the Verb pairs within two EDUs [8]. The event/state occurrences with the causative/effect concepts on our corpus contain the verb phrases expressions (which relate to the NP1s' head nouns) more than only the noun phrase expressions in the literature. (2) the causal pathway extraction with high precisions of our research is based on the actual event/state occurrences with the causative/effect concepts and also emphasizes on the boundary of the sequent wrdCoc pairs through the wrdCoc-pair matching between a wrdCoc pair of each slided-window on the test corpus and the WCP elements. Whereas the causal pathway/chain of the previous works are relied on either the prediction of the next effect from the previous noun phrase events based on the time series [10] or two steps of the cause-effect relation based on noun terms/phrases connected by either the similarity score [11] or the edge weights [12], e.g., 'A causes B' and 'B causes C' are connected by B similarity score without considering a boundary of a sequence event pairs having CErel. (3) our research applies the transitive closure and the dynamic template to indicate and represent the implicit mediator with the explicit mediators respectively to the correct extracted causal pathways with the high concise and clear comprehensible representations whereas the previous work [13] on the implicit knowledge completion of the event chains is based on the similarity scores between each event pairs from the corpus without the CErel consideration whilst our method of using the transitive closure and the dynamic template can apply to present the implicit knowledge completion of [13].
In the future, the temporal feature and the condition feature should be considered to increase the accuracy of the causal pathway extraction by reducing the wrdCoc diversity in terms of conditional cases. Moreover, the proposed method can also be applied in other languages, and the causal pathway extraction (Figure 7) can provide health literacy for non-professional persons to have clear comprehension of disease complications in order to follow the preventive treatments suggested by the physician.
Funding: This research received no external funding.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.