Toward a Multi-Column Knowledge-Oriented Neural Network for Web Corpus Causality Mining

: In the digital age, many sources of textual content are devoted to studying and expressing many sorts of relationships, including employer–employee, if–then, part–whole, product–producer, and cause–effect relations/causality. Mining cause–effect relations are a key topic in many NLP (natural language processing) applications, such as future event prediction, information retrieval, healthcare, scenario generation, decision making, commerce risk management, question answering, and adverse drug reaction. Many statistical and non-statistical methods have been developed in the past to address this topic. Most of them frequently used feature-driven supervised approaches and hand-crafted linguistic patterns. However, the implicit and ambiguous statement of causation prevented these methods from achieving great recall and precision. They cover a limited set of implicit causality and are difﬁcult to extend. In this work, a novel MCKN (multi-column knowledge-oriented network) is introduced. This model includes various knowledge-oriented channels/columns (KCs), where each channel integrates prior human knowledge to capture language cues of causation. MCKN uses unique convolutional word ﬁlters (wf) generated automatically using WordNet and FrameNet. To reduce MCKN’s dimensionality, we use ﬁlter selection and clustering approaches. Our model delivers superior performance on the Alternative Lexicalization (AltLexes) dataset, proving that MCKN is a simpler and distinctive approach for informal datasets.


Introduction
Causality mining is an important method of artificial knowledge discovery that makes use of unstructured datasets. It now presents a crucial and unsolved challenge for NLP. Due to the underlying semantics, grammar, increasing vocabularies, and ambiguous nature of natural language text, causality mining remains a difficult job. As a result, it has prompted a great deal of academic interest in the last 10 years. The development of ML (machine learning) and DL (deep learning) methods has allowed academics to develop more productive models. Causality plays a significant role in decision making [1], question answering [2,3], relationship among everyday activities [4,5], event prediction [6,7], and generating future scenarios [8]. Causality exits in a wide range of disciplines including Environmental Sciences [9], Computer Science and Biology [10], Psychology [11,12], Linguistics [13,14], Medicine [15], and Philosophy [16]. Despite some similarities, the terms "causality" and "correlation" have different meanings. A correlation between two entities, however, does not always mean that a change in one thing is what caused the values of the other thing. The causality refers to the relationships between two regularly occurring events (e1 and e2) or phenomena (P1 and P2), i.e., that the existence of P1 or e1 causes the occurrence of P2 or e2. However, it is challenging to define the term causation in a broader

•
When one event or series of related events occurs first and paves the way for the occurrence of subsequent events. When the initial event (cause) occurs, the second event (effect) inherently or definitely follows.

•
According to the theory of multiple causations, there are numerous possible causes for a particular event, each of which might be sufficient but not necessary for the effect to occur or necessary but insufficient for the effect to exist.
Causality mining emphasizes the automatic detection/extraction of causality between events in the text. As an example, "The 2019 COVID-19 pandemic caused a series of shocking deaths around the world" implies causality between the cause (COVID-19 pandemic) and the effect (deaths). Numerous methods for establishing causation are listed in the literature, including causal knowledge and causal discovery [19,20]. However, earlier attempts at causality mining relied on machine learning and rule-based methods. Rule-based methods, however, require carefully planned linguistic features [13,15,21,22]. Usually, these methods overlook hidden features and sequences of events. They are only able to mine domain-dependent explicit causality from phrases, and they do not take into account the features that point to the presence of dependence relationships. Similar to this, in machine learning approaches [4,[23][24][25], semantic, syntactic, and lexical features are constructed by human operators through diligently feature engineering, and causality is automatically determined from large labeled datasets. The model's effectiveness in these techniques depends on the arrangement and regularities of its features. However, the absence of annotated datasets restricts these methods, which leads to error propagation in the systems.
Currently, the rising demand for the deep neural network such as RNN [26,27], CNN [28], MCNN [29], Transformer block [30], BERT [31], TinyBERT [32], and Hopfield neural network (HNN) [33][34][35] make it possible to perform various processing tasks without complex feature engineering. Deep networks play a key role in encoding the linguistic nature of words into fixed-size vectors to lessen the dependence on NLP toolkits [36] by using pre-trained word embedding. A key component of many NLP strategies, pre-trained word embedding offer a number of advantages over embedding taught from scratch [37]. However, due to the causality ambiguity and implicit nature in web corpora, it is still beyond the scope of DL techniques. To mine implicit and ambiguous causality in the web corpora, we proposed a novel deep MCKN model that outperforms the state-of-the-art in terms of its ability to manage the causality problem. However, most of the current approaches are strict when it comes to autonomously engineering features in implicit and ambiguous datasets. As a result, the proposed MCKN is based on a multi-knowledgeoriented channel by parsing every word in the source segments and connective (AltLex) and then identifying causation between the segments on both sides of the connective.
According to the proposed paradigm, each channel has a specific input segment or connective that presents a particular set of KCs. Each channel can incorporate linguistic knowledge of cause-and-effect relationships from world knowledge bases (WordNet, FrameNet) by capturing important linguistic cues of causality at the segment and connective level. Each channel uses a variety of convolutional word filters with different window sizes to create numerous unique feature maps.. Utilizing max pooling, the convolution results of each filter are further aggregated, and the feature map is mined for the most promising features that can be used for classification. The usage of "wf", a pre-trained word embedding generated automatically by Algorithm 1 from the "Bootstraps" corpus, lowers the overall dimensionality of the proposed model. After max pooling, the feature maps of each channel are combined, and dimensionality reduction is used to reduce the dimensionality of the combined feature maps. In the end, four object pairs were produced and sent to RN for further processing since RN needs object pairs for relation reasoning. For the same purpose, we also apply WordNet categorical approaches, "wf" selection techniques, FrameNet causal scores, and clustering algorithms for redundant and non-discriminative features. The goals and contributions of our research are described in the section below.

Algorithm 1: Automatic word filters' generation
Step 1: Find all the lexical units of 50 causal semantic frames from FrameNet and group them by the number of words (max: 64).

•
This study is unique in that it analyzes sentences for causality leveraging web corpora, which include noisier, larger, and more muddled data. • The suggested model uniqueness is its first-ever use of multiple KCs and a novel word filter technique, which significantly decreased the model dimensionality.

•
The proposed model addresses implicit and ambiguous intra-sentence causality using segment and connective levels features. • This is the first attempt to train in all channels by using convolutional "wf" rather than a data-oriented pre-defined convolutional filter.
• Extensive experiments on publicly available datasets have shown that the MCKN model performs much better than many baseline methods and text classification techniques.
The remainder of this article is structured as follows. Section 2 presents the literature review. Section 3 details the suggested strategy. The entire experimental process is covered in detail in Section 4. Finally, Section 5 summarizes our conclusion.

Literature Review
In terms of causality mining, previous research has mainly been divided into ML and DL methods. The performance gain of DL over ML techniques is significant. ML approaches normally require sophisticated feature engineering. For ML approaches, Ref. [38] uses a dependency structure to derive causation event pairs. In [39], causal connectives were used to govern how lexico-syntactic patterns and causal connectives interacted. These connectives were obtained by computing the similarity of sentence syntax-dependent structures through the Restricted Hidden Nave Bayes (RHNB) classifier. In [22], a related monolingual corpus of simple and English Wikipedia PDTB is utilized to integrate world knowledge (WordNet, VerbNet, and FrameNet) to evaluate the correlations across words and segments though hardly handling those terms that never occur in the learning stage. In [40], conditional text generation networks are proposed to craft possible causes and effects for any free-form textual event. They focus on explicit relations within individual sentences by linking one part of a sentence to another and using generated patterns instead of sentence-level human annotation.
In contrast to ML approaches, models in deep learning techniques automatically learn and extract useful features. In NLP, such models use pre-trained word embedding (Google News, GloVe-6B, GloVe-840B, and Pre-trained Wiki), which play a significant role in encoding syntactic and semantic properties of words into fixed-size vectors to reduce the dependency on NLP toolkits [36]. In NLP, two commonly used models are RNNs and CNNs. CNNs and RNNs [41] have been applied to document and paraphrase classification [42][43][44] and relation extraction/classification [36,45]. In [46], a variant of CNN, multi-column convolutional neural networks (MCNN) is presented to handle multiple features in the question answering (QA) of candidate answers. An analogue of MCNNs for relational classification is proposed in [47], as the piecewise max-pooling network. In [29], using external BK (background knowledge), a well-known model of MCNNs based on [48] is introduced. By utilizing question and response sequences [8], the MCNN model enriches causality attention, which is in contrast with [29].
In [49], the FFNN (feed-forward neural network) is proposed to augment the feature set to identify causality by computing the distance among events triggering words and related words in phrases. The work of [50] is closely related [49] by using a novel FFNN with a novel contextual word extension method. They use BK as an event context word extension to extract causal network structures from news articles to classify event causality. This is a challenging job as tweets often consist of a highly informal, unstructured nature, and lack contextual knowledge. In [51], a TCDF (temporal causal discovery framework) is presented to obtain a temporal causal graph by mining cause-effect relationship in time series datasets. They applied a multi-attention-based CNN with a causal support stage. BERT is a deep pre-trained language representation system using masked and Transformer blocks, which produced improved results in various NLP tasks, driven by transfer learning in computer vision.
In [26], a novel knowledge-oriented CNN (K-CNN) is presented for causal relations recognition, which combines a data-oriented channel (DOC) and a knowledge-oriented channel (KOC). The DOC acquires major features of causal relationships in the source data, while KOC adds human past knowledge to retain the linguistic clues of causal relationships. KOC automatically generates convolutional filters from FrameNet and WordNet without the requirement to train a classifier with a lot of data. Such filters are causal word embedding. In contrast to statistical, non-statistical, and single-level DL models, deep multi-level models exhibit satisfactory performance. However, they hardly incorporate implicit and ambiguous causality. In [52], a graph reasoning technique based on document-level context is proposed to recognize event causality. In [53], a SCITE (self-attentive Bi-LSTM-CRF wIth Transferred Embedding) method is presented and formulates causality as a sequence tagging by mining causal event pairs and their relationship. Moreover, they use multi-head self-attention to enhance their performance [30].
In [54], a novel approach is proposed that exploits the advantages of neural modelbased approaches and feature engineering. The latest work [55] uses a head-to-tail entity annotation method that expresses the entire semantics of complex cause-effect relationships and visibly finds entity boundaries in source sentences. They employ entity location perception along with RPA-GCNs (Relation Position and Attention-graph Convolutional Networks), GATs (Graph Attention Networks), and other techniques. In [56], a generative approach for extracting cause-effect relationships via encoder-decoder and pointer networks. They enhanced the performance but required more time to produce the required result. In comparison to statistical and non-statistical techniques, DL approaches with pretrained word embedding are more fruitful. However, they work on a huge training dataset that covers all causality expressions in the source text, which is somewhat impossible due to the diversity and ambiguity of phrases and words in the dataset. However, the ambiguous and implicit nature of causality is a challenging task. Our MCKN is motivated and inspired by [26,29,31,57] for mining implicit and ambiguous causality sentences from publically available web corpora. To begin, keep in mind that previous methods for leveraging MC-NNs for NLP applications used multiple CNN channels with pre-defined convolutional filters for training. Our inspiration is a novel approach using the concept of the MCNN approach. Contrary to MCNNs, the proposed approach is based on knowledge-oriented channels by using novel convolutional word filters generated by Algorithm 1. Table 1 provides a more concise description of the reviewed material.

Proposed Approach
This section explores the MCKN model. This model consists of three channels/columns, where each channel deals with its respective AltLex/connective (L), segments after AltLex (AL), and segments before AltLex (BL) in the target sentence. MCKN mainly targeted implicit and ambiguous causalities. The MCKN uses convolutional word filters instead of pre-trained convolutional filters. In Figure 1, we explored the architecture of the MCKN model, including (i) the first column dealing with the BL (e1) part of the input sentence, (ii) the second column dealing with the L part of the input sentence, and (iii) the third column dealing with the AL (e2) part of the input sentence. More details about Figure 1 are covered in Section 3.4.

Linguistics Background of Source Corpus
This part discusses the linguistic background of causality and the AltLexes (https: //github.com/chridey/altlex, accessed on 3 May 2021) dataset. About 12% of the Pine Discourse Tree Bank (PDTB) is labeled as causal, and around 26% is implicit [58]. In addition, there exists another type of implicit relation called "AltLex", which represents causality and is marked as an open and infinite class of causality. The generalization of "AltLex" is extended with an open class of markers [22]. Some examples in the "AltLexes" dataset are not present in the explicit relations of PDTB including ambiguous causal verbs, e.g., "COVID-19 made many countries affected" and partial prepositional phrases, e.g., "He has made aircraft with the idea of a new deep neural technology". In the first example, the term "made" has numerous meanings and is employed to express causation. However, in the second example, the causal relationship expression is not clear. According to our analysis, the parallel data constructed has 1164 causal connectives and about 7627 non-causal connections. Furthermore, their intersection has 155 types of connectives, which are hybrid. It shows their reliance on a causal set of 12.6%, and reliance on non-causal sets is 1.8% [22]. Several implicit and heterogeneous relationships are discovered as a result of the analysis. In this case, prior approaches have several demerits to making an expert system. Applied MCKN architecture. This contains three columns using convolutional word filters: the first column processes the BL (e1) part of the input sentence, the second column deals with the L part of the input sentence, and the third column targets the AL (e2) part of the input sentence.

Linguistics Background of Source Corpus
This part discusses the linguistic background of causality and the AltLexes (https://github.com/chridey/altlex, accessed on 3 May 2021) dataset. About 12% of the Pine Discourse Tree Bank (PDTB) is labeled as causal, and around 26% is implicit [58]. In addition, there exists another type of implicit relation called "AltLex", which represents causality and is marked as an open and infinite class of causality. The generalization of "AltLex" is extended with an open class of markers [22]. Some examples in the "AltLexes" dataset are not present in the explicit relations of PDTB including ambiguous causal Applied MCKN architecture. This contains three columns using convolutional word filters: the first column processes the BL (e1) part of the input sentence, the second column deals with the L part of the input sentence, and the third column targets the AL (e2) part of the input sentence.

Input Sentence Representations
The input sentence (N) contains 'n' tokens, N = {n 1 , n 2 , . . . , n i−1 , n i }. Where 'n i ' is the filter token in the sentence at 'i' position. Further, each sentence is formatted to L, AL, and BL. The purpose is to generate sentence level 'y' predication, where 'y' is the input sentence label shown in Equation (1). For the parallel corpus feature in our model, we employ a pair of simple and English Wikipedia sentences, although it still only takes a single sentence as input [43].
Motivated by [30], each token/word in the input sentence can be denoted by summing the corresponding token embedding, position embedding, and segment embedding. Likewise, the early work segments embedded here indicate segments L, BL, and AL in each sentence. In Figure 2, first of all, the "word2vec toolkit" is used for pre-training of word embedding with dimension d word , positional embedding with dimension d pos , and segment embedding with dimension d seg for linguistic information. Lastly, summing all three embeddings results in new representationŃ = {z 1 , z 2 , z 3 . . . , z n−1 , z n }, where z n ∈ R d for token n i , and keep equal d = d word = d pos = d seg dimensions of the word embedding, position embedding, and segment embedding. Therefore, theŃ representation of input sentences could bring fundamental features to complicated networks.

Input Sentence Representations
The input sentence (N) contains 'n' tokens, = { 1 , 2 , … , −1 , }. Where ' ' is the filter token in the sentence at 'i' position. Further, each sentence is formatted to L, AL, and BL. The purpose is to generate sentence level 'ŷ' predication, where 'y' is the input sentence label shown in Equation (1). For the parallel corpus feature in our model, we employ a pair of simple and English Wikipedia sentences, although it still only takes a single sentence as input [43].
Motivated by [30], each token/word in the input sentence can be denoted by summing the corresponding token embedding, position embedding, and segment embedding. Likewise, the early work segments embedded here indicate segments L, BL, and AL in each sentence. In Figure 2, first of all, the "word2vec toolkit" is used for pre-training of word embedding with dimension , positional embedding with dimension , and segment embedding with dimension for linguistic information. Lastly, summing all three embeddings results in new representation ́= { 1 , 2 , 3 … , −1 , }, where ∈ for token , and keep equal = = = dimensions of the word embedding, position embedding, and segment embedding. Therefore, the ́ representation of input sentences could bring fundamental features to complicated networks.

Relation Network
In visual question answering (V-QA) [59], the relation network (RN) plays a very significant role. It can be efficiently integrated with DL approaches including CNNs (DeepCNN, knowledge CNN, and MCNN) and RNN (GRU, bi-GRU, LSTM, bi-LSTM) for performance enhancement. The original RN, however, only performs single-step inference, such as → rather than → → . For those tasks which need multistep relational reasoning, Ref. [60] introduced RNNs that work on graph representations of

Relation Network
In visual question answering (V-QA) [59], the relation network (RN) plays a very significant role. It can be efficiently integrated with DL approaches including CNNs (DeepCNN, knowledge CNN, and MCNN) and RNN (GRU, bi-GRU, LSTM, bi-LSTM) for performance enhancement. The original RN, however, only performs single-step inference, such as A → B rather than A → B → C . For those tasks which need multistep relational reasoning, Ref. [60] introduced RNNs that work on graph representations of entities.. Furthermore, Ref. [61] made memory networks with RNs capable of complicated reasoning, which changed the computational complexity from nonlinear to linear. Though, most jobs are only used for text and V-QA. Similarly, we consider RN in the proposed model, which takes input object pairs from KCs and makes effective relational reasoning.

About Knowledge-Oriented Channel
Encouraged by [26], we have applied three different KCs to recognize keywords, cue words, and cue phrases of causality in connective and segment levels of the input sentence. For convolution operation, we used "wf" in each channel, which is a variant of the convolutional filter. It is automatically generated from knowledge bases (WordNet, FrameNet) using the linguistic knowledge of causality. Compared with the CNN convolutional filters, the "wf" more precisely represents causal relationships. The weights of "wf" are the pretrained word embedding, which can be used without additional training. Using the "wf" approach will significantly drop the number of pre-parameters of the model and reduce the over-fitting issue in the small data corpus. Figure 1 depicts the proposed network's architecture, which consists of three channels. Each channel has its specific input segment format including segments BL, AL, and connective L. The "L" is usually used to represent the cue phrases, cue words, and keywords for cause-effect relations.
Examples of such connectives include because, as result, lead to, resulted, due to, and trigger. These words in the connectives part away from the BL and AL segments and may affect the performance of a network. In the past, KCs only paid attention to the "L" part of the input sentence (between event e1 and event e2) because it usually represents causality signals. However, in the proposed model, each segment and connective has its own KCs. To decrease the morphological variations of tokens in each segment, we used WordNet tokens to make it consistent and mark each word in its lowercase by using the lemmatizer function, and further, every word is converted into a precise input format as shown in Figure 2. The single knowledge-oriented channel of the proposed model is shown in Figure 3. In the input format, we set the maximum size of L as 8 words, and each BL and AL to be 64 words. Sentences with fewer than 8 words in the "L" level and fewer than 64 words in the segments level are padded with padding characters with zero embedding because CNN works with a fixed input size.

Word Filters Archive Generation
Word filters are the embedding of causal words, cue words, and cue phrases that are extracted from WordNet and FrameNet knowledge bases. Among them, WordNet is a huge database of lexical that categorizes English words into sets of synonyms known as synsets to denote diverse concepts [62]. To mimic their semantic and lexical relationships, all synsets are linked in a hierarchical format. The meaning of every synset is given by a gloss with some examples [26], where example 1 describes the WordNet elements for the word "cause", which belongs to specific synsets. Similar to this, FrameNet is a lexical resource built on the frame semantic theory; it organizes English phrases and words into higher-level semantic frames exploring a variety of ideas [63]. Each frame is a conceptual arrangement that includes a discussion of the type of event, the relation, or the object with a conceptual definition; the participants in the frame are called frame elements,

Word Filters Archive Generation
Word filters are the embedding of causal words, cue words, and cue phrases that are extracted from WordNet and FrameNet knowledge bases. Among them, WordNet is a huge database of lexical that categorizes English words into sets of synonyms known as synsets to denote diverse concepts [62]. To mimic their semantic and lexical relationships, all synsets are linked in a hierarchical format. The meaning of every synset is given by a gloss with some examples [26], where example 1 describes the WordNet elements for the word "cause", which belongs to specific synsets. Similar to this, FrameNet is a lexical resource built on the frame semantic theory; it organizes English phrases and words into higher-level semantic frames exploring a variety of ideas [63]. Each frame is a conceptual arrangement that includes a discussion of the type of event, the relation, or the object with a conceptual definition; the participants in the frame are called frame elements, words that frequently appear in the frame (referred to as lexical units (Lu)), and the relationship to other frames. The FrameNet components of the "Causation" frame are described in Example 2 [26].
In the proposed work, 50 causal frames (CF) are identified from FrameNet including triggering, response, causation, causation_scenario, reason, and explaining the_facts, and also 44 frames starting with the word cause. The "Lu" involved in this CF is the important clues and regularly seemed words that raise causality in the text, hence these "Lu" can be preserved much like cue phrases, clue words, and keywords of causality. To further extend these "Lu" to cover causal words more widely, we automatically construct a bank of causal words. These causal words and word embedding are used to find the weights of convolutional "wf". Automatically generating "wf" is accomplished by utilizing the improved Algorithm 1 [26]. Such "wf" more efficiently represents keywords, clue words, and cue phrases of causality. These "wf" are more effective than the convolutional filter learned from training. Moreover, the weights of these "wf" are static values.
Finally, about 850 uni-grams, 240 bi-gram, and 20 tri-gram "wf" are created. During convolutional, "wf" is convolved with n-grams to obtain the important linguistic clues of causal relationships in the input text, resulting in a sequence of similarity scores. The proposed convolutional method is capable to capture semantically related causal words other than those that exist in the "wf" bank. We create several different filters for the L part of the input sentence, where each filter size ranges from 1 to 8 filter words. Similarly, we create different filters for AL and BL; each filter size ranges from 1 to 64 filter words. The convolutional "wf" for every "Lu" is formatted as [c 1 , c 2 , . . . , c i ] in Lu j , (j = 1, 2, 3, 4, 5, 6, 7, 8 . . . 64), the weights of corresponding 'wf' are f = [ f 1 , f 2 , . . . , f k ] T . Where f k ∈ R e is the word embedding of c i discovered by looking the word embedding table W wrd ∈ R e×|v| . Further, the f = [ f 1 , f 2 , . . . , f k ] T convolved with input text matrix emb k = {w 1 , w 2 , . . . , w n 1 }, where k is the convolutional window sizes (uni-gram, bi-gram, and n-gram). We follow [26], and modify the convolutional operation of each KCs so that each 'wf' becomes a future map m = m 1 , m 2 , . . . , m n 1 ,−k+1 , where m i signifies the similarity among the "wf" and the k-gram w kgram = [w 1 , . . . , w i+k−1 ] T in input sentence. The improved convolutional method is represented by Equation (2).
In Equation (2), "b" represents the bias term. Rather than using a non-linear function, we divided the CNN convolutional results by the window size k. By limiting f j and w i+j−1 (word embedding) to unit vectors, the resultant value of m i becomes the cosine similarity between f and w kgram . The goal of cosine similarity in feature maps is to achieve equal importance of "wf" with different lengths by creating the same scale for all convolutional window sizes, while the conventional method will obtain a higher number for the wider window size. The most specific feature map is generated using max-pooling for each filter to further aggregate the convolutional results. The pooling procedure for each feature map is shown in Equation (3). The largest cosine similarity provides strong cues for the presence of cue phrases and keywords in the text, which is why the feature map maximum value is obtained. p = max m 1 , m 2 , m 3 , . . . , m n 1 −k+1 About 900 "wf" are produced by Algorithm 1, which is thought to be highly dimensional and has limited training data. These "wf" for causality mining provide a large number of features, some of which may be redundant and irrelevant. In order to enhance the performance of the model, we used "wf" clustering and selection [26].

Segments and Connective Level Processing
The proposed model presents a novel method to mine causal relationships within a single sentence at the connective and segment level using 3 KCs and RN. The input connective L and segments BL and AL can be denoted as Z L ∈ R J L×d , Z BL ∈ R J BL×d , and Z AL ∈ R J AL×d input format. Where, J BL , J AL , and J L are the token lengths in each segment and connective. Each channel is responsible for parsing Z BL , Z L , and Z AL into a set of objects. Unlike [26,58], MCKN convolves them through a 1D convolutional layer into different window sizes for "k" feature maps of size J BL×1 , J L×1 , and J AL×1 , where "k" is the sum of "wf". After convolution, each segment's and connective feature maps are rescaled into a k-dimensional vector via a max-pooling layer, and dimensionality reduction is then implemented by further reducing the dimensionality. Finally, we create a set of objects in Equation (4).
In addition, because RN works with objects, we created four object pairs in Equation (5).
The ";" is now an operator that concatenates object feature vectors. We can simplify it using the notation in Equation (6), where ' * ' represents a pair-wise operation. For causality candidates, BL * L and L * AL determine the relationship between the cause-effect event and L, while BL * AL and AL * BL infer the direction of causality.
As a result, the simplified form of the object is represented by the Equation (7).
Here Op ∈ R 4 × (2k + 2dg) is the matrix representation of object pairs. More generally speaking, by changing the architecture in a mathematical formulation, we were able to derive the final representation (Final_rep R 4dg ) at the segment and connective levels in Equation (8).
At the segments and connective level, MCKN transforms segments and connectives into object pairs and then integrates these object pairs for pair-wise inference to discover the relationship between segments and connectives.

Causality Identification
The applied model discovers causality in each sentence by passing "Final_rep" to FFN. We used a 2-layer FFN involving a "dg" unit with a ReLU function followed by SoftMax for prediction, which is expressed mathematically in Equation (9).
There is rich discrimination between causal and non-causal samples in the AltLexes dataset. By using a Cross-Entropy (CE) loss function, the apparent inequality of causal and non-causal examples in the source dataset can lead to unsatisfactory outcomes. Since each connective and segments in the target sentence contain an ambiguous and heterogeneous connective (make, made, create, construct, etc.), effect keyword (disable, lost, miss, destruc-tion, company, died, etc.), and causal keywords (lack, accident, fire, tupan, tsunami, flood, earth quick, blast, etc.), it is hard to detect in each sentence.
As a result, it is required to give causal and non-causal losses a soft weight, enabling the model to focus more on ambiguous, implicit, and heterogeneous samples. Inspired by [54,64], we consider the focal loss into a progress loss function [65], by adding a modulating factor to the CE loss (1 −ŷ)β, with a tunable hyperparameter β ≥ 0. In Equation (10), the focal loss L n is formulated as the objective function, with 'α' denoting the balance weight hyperparameter.

Experimental Settings
In this part, we explore the MCKN model at the sentence level, which combines three knowledge-oriented channels for causality mining.

Hyperparameters and Evaluation Metrices
Hyperparameter: In the implementations, we set the initial learning rate of the proposed model as 1 × 10 −2.5 , and gradually compress after the F1 score has stopped growing for more than 6 epochs. During training, we set the batch size to 32, the epoch size is 15, and apply L2 regularization to deal with the over-fitting issues with a 0.5 dropout rate. We set the regularization coefficient to 3 × 10 −5 . For focal loss, we used α = 0.80 and β = 4.5. For optimization purposes, Adam optimizer [66] is used with β 1 = 0.9, β 2 = 0.999, ∈ = 1 × 10 −8 hyperparamenter and clipped gradients norm. We used k = 130 for the number of kernel/wf of various window sizes ranging from 1 to 8 at the "L" level and 1 to 64 at each of the "AL" and "BL" levels. In Table 2, we summarize all hyperparameters, which provides a more convenient approach for the reader.  Table 3 using a variety of evaluation metrics, such as precision, recall, and F1-score. The prediction ability of algorithms is measured by their precision (Pr). It illustrates how many positive predictions are achieved and how accurate predictions are made by individuals who make them. The 'Pr' is calculated in Equation (11). Among these, true positive (TP) is the number of correctly classified positive cases, and true negative (TN) is the proportion of correctly classified negative events. False positive (FP) refers to the number of positively classed instances that were misclassified, while false negative (FN) refers to the number of positively classified instances that were incorrectly classified. F-score (F1) is a crucial need for simulating the situation with the highest probability of obtaining the correct answer and explicitly demonstrating the algorithm's ability. Moreover, F1-score is defined as a harmonic mean of sensitivity and precision. The F value is calculated in Equation (12).
Recall (Rc) or sensitivity examines how well a case accurately yields a positive outcome for an instance that has an explicit condition. Equation (13) calculates the value of Rc.

Baseline Methods
In this section, various baseline approaches are listed, including MCNN, K-CNN, DPCNN, and BERT-base. DPCNN [57] is a word-level deep neural network for topic categorization and sentiment classification. It can create downsampling without increasing the number of feature maps, which can efficiently represent long-range relationships. A deep pre-trained language representation system called BERT-base [31] is built on masked and Transformer blocks, and it has improved a number of NLP applications, encouraged by transfer learning from the computer vision sector. The next notable work in this field is MCNN [29], a multi-column CNN with BK that integrates event causality candidates and their contexts with relative web corpus. K-CNN [26] is the next novel work, which combines a data-oriented network with a knowledge-oriented network by using convolutional "wf", thereby reducing the overall dimension of the model.

Results
Before releasing the results, we run each reproducible experiment six times for causality extraction using a train/bootstrapped/Dev test split described in Section 4.1. Then, we report the average result along with its standard deviation. Table 3 compares MCKN's performance with state-of-the-art methods employing precision, recall, and F1-score in the test set, which is a randomly selected subset of both the train and bootstrapped datasets.
Our model performs, in particular, by learning distinct semantic representations of causation at the connective and segmental levels. In the train dataset, compared with the best state-of-the-art feature engineering methods [26,29,31,57], MCKN enhanced the maximum precision by 21.58% and a minimum of 5.42%, F1-score recorded by a maximum of 33.27% and a minimum of 2.91%, and similarly, a low recall rate is recorded 0.48%, since it emphasizes on the interchangeability of connectives, whereas parallel examples frequently contain the same connectives that might be evaluated as false negatives.
It is amazing that the proposed work on the bootstrapped train dataset enhances the precision up to a maximum of 17.14% and a minimum of 12.67%, the F1 score of a maximum of 14.01% and a minimum of 2.64%. They recorded a maximum of 17.82% recall and a minimum of 1.64% because the bootstrapped train dataset has many more samples of the causal signal compared to the training train dataset. The suggested model uses a novel combination of KCs with "wf," RN, and FFNN, as well as a unique combination of a different hyperparameter employed in the training stage, to achieve the best precision, recall, and F1-Score.
Contrary to CNN techniques such as K-CNN [26], DPCNN [57], BERT-base [31], and MCNN [29] with a pre-trained convolutional filter mechanism, the usefulness of the MCKN model is the uses of novel KCs with "wf". To the best of my knowledge, this is the first attempt to mine implicit causality in the web corpus using all KC channels with the unique "wf." Since "wf" may effectively target the causal relationships in the target sentence by effectively decreasing the number of parameters of the model. The proposed model performs satisfactorily when applied to single-sentence texts, but it is challenging to apply to texts with multiple sentences. The suggested model's successful findings demonstrate that deep knowledge-oriented convolutional techniques are more effective than conventional rule-based, statistical, and convolutional techniques in this area. Contrary to text classification, classifying causality is a challenging task that necessitates strong multilevel relational reasoning abilities. Figure 4 shows the relationship between epochs and their performance on the train dataset, while Figure 5 demonstrates the relationship between model performance and the number of epochs in the bootstrapped train dataset.
It is amazing that the proposed work on the bootstrapped train dataset enhanc precision up to a maximum of 17.14% and a minimum of 12.67%, the F1 score of a imum of 14.01% and a minimum of 2.64%. They recorded a maximum of 17.82% and a minimum of 1.64% because the bootstrapped train dataset has many more sam of the causal signal compared to the training train dataset. The suggested model u novel combination of KCs with "wf," RN, and FFNN, as well as a unique combinati a different hyperparameter employed in the training stage, to achieve the best prec recall, and F1-Score.
Contrary to CNN techniques such as K-CNN [26], DPCNN [57], BERT-base [31 MCNN [29] with a pre-trained convolutional filter mechanism, the usefulness o MCKN model is the uses of novel KCs with "wf". To the best of my knowledge, this first attempt to mine implicit causality in the web corpus using all KC channels wit unique "wf." Since "wf" may effectively target the causal relationships in the targe tence by effectively decreasing the number of parameters of the model. The prop model performs satisfactorily when applied to single-sentence texts, but it is challen to apply to texts with multiple sentences. The suggested model's successful fin demonstrate that deep knowledge-oriented convolutional techniques are more effe than conventional rule-based, statistical, and convolutional techniques in this area. trary to text classification, classifying causality is a challenging task that necess strong multi-level relational reasoning abilities. Figure 4 shows the relationship bet epochs and their performance on the train dataset, while Figure 5 demonstrates the tionship between model performance and the number of epochs in the bootstra train dataset.

Effect of Multi-Column KNN
The validation matrices for the "AltLexes" dataset are shown in Table 3. We discovered from the conventional KNN models that the two-column KNN performs slightly better than the single-column KNN because it makes use of multiple convolutional window sizes, which can capture more information on causality from various n-grams. Similarly, by adding more information, the three-column KNN is better than two-column KNN. Contrary to K-CNN [26], we present three KCs together with RN. The development of the experimental results proves that multi-column KNN with convolutional "wf" can more effectively extract causality. The performance advantage of multi-column KNN over multi-column conventional CNN can be attributed to the following evidence: • Compared with randomly initialized convolutional filters, the "wf" has an extra precise illustration and pays more extensive attention to the cue phrases, cue terms, and keywords of causality; this makes it possible for the model to more effectively extract linguistic cues that indicate causation in a sentence.

•
The use of multi-channel KNN keeps the model from losing key causality properties. By utilizing already existing knowledge bases, the KCs are able to identify substantial language cues of causation at the connectives and segment level of the target sentences.

•
In contrast to convolutional CNNs, KCs have a significantly lower pre-parameter count. This assists in resolving the issue of excessive over-fitting in a limited training dataset.

Strength of the MCKN
To understand more information about MCKN, we used both Areas under the Precision-Recall Curve (AUPRC) and Areas under the Receiver Operator Curve (AUROC) to estimate the specificity and sensitivity of the model. We evaluated the impact and robustness of different word embedding on performance. In the past, most tasks were based on a one-hot encoding and word-piece algorithm, different from pre-trained word embedding (GloVe-840B, Pre-trained Wiki, and Google News), used by our model. Table  4 shows pre-trained word embeddings with AUPRC and AUPOC scores, demonstrating the effectiveness of the proposed model.

Effect of Multi-Column KNN
The validation matrices for the "AltLexes" dataset are shown in Table 3. We discovered from the conventional KNN models that the two-column KNN performs slightly better than the single-column KNN because it makes use of multiple convolutional window sizes, which can capture more information on causality from various n-grams. Similarly, by adding more information, the three-column KNN is better than two-column KNN. Contrary to K-CNN [26], we present three KCs together with RN. The development of the experimental results proves that multi-column KNN with convolutional "wf" can more effectively extract causality. The performance advantage of multi-column KNN over multi-column conventional CNN can be attributed to the following evidence: • Compared with randomly initialized convolutional filters, the "wf" has an extra precise illustration and pays more extensive attention to the cue phrases, cue terms, and keywords of causality; this makes it possible for the model to more effectively extract linguistic cues that indicate causation in a sentence.

•
The use of multi-channel KNN keeps the model from losing key causality properties. By utilizing already existing knowledge bases, the KCs are able to identify substantial language cues of causation at the connectives and segment level of the target sentences.

•
In contrast to convolutional CNNs, KCs have a significantly lower pre-parameter count. This assists in resolving the issue of excessive over-fitting in a limited training dataset.

Strength of the MCKN
To understand more information about MCKN, we used both Areas under the Precision-Recall Curve (AUPRC) and Areas under the Receiver Operator Curve (AUROC) to estimate the specificity and sensitivity of the model. We evaluated the impact and robustness of different word embedding on performance. In the past, most tasks were based on a one-hot encoding and word-piece algorithm, different from pre-trained word embedding (GloVe-840B, Pre-trained Wiki, and Google News), used by our model. Table 4 shows pre-trained word embeddings with AUPRC and AUPOC scores, demonstrating the effectiveness of the proposed model.
Our drawing in Figure 6 more effectively illustrates the analysis of the pre-training words. In Figure 6, y-axis signifies the score of the Precision-Recall Curve (AUPRC), Areas under the Receiver Operator Curve (AUROC), and F-Score to estimate the specificity and sensitivity of the model.  Our drawing in Figure 6 more effectively illustrates the analysis of the pre-training words. In Figure 6, y-axis signifies the score of the Precision-Recall Curve (AUPRC), Areas under the Receiver Operator Curve (AUROC), and F-Score to estimate the specificity and sensitivity of the model.

Ablation Study
Exploring MCKN and its contributions is very important to readers. In this section, we show the ablation evaluation through different training modules of the proposed model. Table 5 describes the results of the different modules of MCKN on the two datasets. In the training dataset, the single-column KNN + RN module reaches the precision (p) value of 78.74, the recall (R) value is 76.56, and an F1 Score (F-1) is 73.85. The two-column KNN+RN has enhanced the p value by 4.38, the R-value by 2.77, and the F-1 value by 7.54; further, the three-column KNN + RN module enhanced the p value of 7.43, the R-value of 3.89, and the F-1 value of 5.42 compared to two-column KNN + RN. Similarly, in the bootstrapped dataset, the single-column KNN + RN module reached a p value of 79.23, an R-value of 82.11, and an F-1 of 80.21. In the two-column KNN + RN, the p value is enhanced by 7.96, the R-value by 2.11, and the F-1 value by 5.6, of which the three-column KNN + RN module further enhanced the p value of 5.94, the R-value 7.57, and the F-1 value 4.34 compared to two-columnKNN + RN. Based on the above analysis, compared with the single KNN + RN and two-column KNN + RN, the three-column KNN + RN (MCKN) shows significant results, because the three-column KNN + RN uses

Ablation Study
Exploring MCKN and its contributions is very important to readers. In this section, we show the ablation evaluation through different training modules of the proposed model. Table 5 describes the results of the different modules of MCKN on the two datasets. In the training dataset, the single-column KNN + RN module reaches the precision (p) value of 78.74, the recall (R) value is 76.56, and an F1 Score (F-1) is 73.85. The two-column KNN+RN has enhanced the p value by 4.38, the R-value by 2.77, and the F-1 value by 7.54; further, the three-column KNN + RN module enhanced the p value of 7.43, the R-value of 3.89, and the F-1 value of 5.42 compared to two-column KNN + RN. Similarly, in the bootstrapped dataset, the single-column KNN + RN module reached a p value of 79.23, an R-value of 82.11, and an F-1 of 80.21. In the two-column KNN + RN, the p value is enhanced by 7.96, the R-value by 2.11, and the F-1 value by 5.6, of which the three-column KNN + RN module further enhanced the p value of 5.94, the R-value 7.57, and the F-1 value 4.34 compared to two-column KNN + RN. Based on the above analysis, compared with the single KNN + RN and two-column KNN + RN, the three-column KNN + RN (MCKN) shows significant results, because the three-column KNN + RN uses the combined features and knowledge of all channels. This demonstrates how multi-column KNNs and RN significantly boosted the model's overall performance.

Conclusions
The novelty of this work is how to recognize ambiguous and implicit causality in the informal "AltLexes" web corpus. When compared to online corpora, the majority of earlier works used more formal newspaper, historical stories, and book corpora that incorporated clear causation. They frequently employ feature-driven supervised techniques to target explicit causality and overlook the implicit and ambiguous causation in the web corpus. In this work, a novel MCKN model is proposed that combines more than one KC and is integrated with RN for causality extraction in the unstructured web corpus. MCKN deals with each sentence at the connective and segment level for causal relational reasoning. The proposal employs a new convolutional word filter approach that drastically reduces the number of model parameters. Our model demonstrates the power of inferring complicated causation at the sentence level, in contrast to causality and document classification algorithms. Although, implicit and ambiguous causality and their event pair detection across sentences/multi-sentence text is still a demanding problem. For such task in future development, it is imperative to employ this model with more advanced features and a standardized dataset.