Numerical Markov Logic Network: A Scalable Probabilistic Framework for Hybrid Knowledge Inference

: In recent years, the Markov Logic Network (MLN) has emerged as a powerful tool for knowledge-based inference due to its ability to combine ﬁrst-order logic inference and probabilistic reasoning. Unfortunately, current MLN solutions cannot efﬁciently support knowledge inference involving arithmetic expressions, which is required to model the interaction between logic relations and numerical values in many real applications. In this paper, we propose a probabilistic inference framework, called the Numerical Markov Logic Network (NMLN), to enable efﬁcient inference of hybrid knowledge involving both logic and arithmetic expressions. We ﬁrst introduce the hybrid knowledge rules, then deﬁne an inference model, and ﬁnally, present a technique based on convex optimization for efﬁcient inference. Built on decomposable exp-loss function, the proposed inference model can process hybrid knowledge rules more effectively and efﬁciently than the existing MLN approaches. Finally, we empirically evaluate the performance of the proposed approach on real data. Our experiments show that compared to the state-of-the-art MLN solution, it can achieve better prediction accuracy while signiﬁcantly reducing inference time. has good scalability for of explosion. parallel solution for hybrid knowledge inference based on


Introduction
In recent years, the Markov Logic Network (MLN) [1] has emerged as a powerful tool for knowledge-based inference due to its ability to combine first-order logic inference and probabilistic reasoning. It has been applied in a wide variety of applications, e.g., knowledge base construction [2][3][4][5] and entity resolution [6]. The state-of-the-art probabilistic knowledge-based systems (e.g., Tuffy [7], ProKB [8], and Deepdive [9]) tackle the problem of MLN inference in two steps, grounding and inference. The step of grounding constructs a Markov network by knowledge rules; it is followed by the step of inference, which searches for the Maximum A Posteriori (MAP) probability or marginal probability of the variables.
In many real scenarios, for instance the inference on phone performance as shown in Table 1, knowledge rules may involve both first-order logic and arithmetic expressions. However, the existing MLN inference techniques cannot effectively support these hybrid rules due to the following two new challenges: 1.
Modeling the integration of logic formula and arithmetic expression. We note that the latest approach of Probabilistic Soft Logic (PSL) [10] enables MAP inference on continuous variables over a set of arithmetic rules such as "r 2 : Per f ormance(p) 0.2" by considering it as a constraint on prior probability. However, it can be observed that an arithmetic expression (e.g., FastCPU(c) 0.9) is not a predefined continuous logic variable; thus, it cannot be easily integrated into the objective function defined by PSL. Specifically, even though the arithmetic inequalities like "FastCPU(c) 0.9" in r 3 can be regarded as a Boolean variable by PSL, computing the truth value of r 3 by the max function used in PSL would render its corresponding objective function nonconvex. Since the inference of PSL was built on convex optimization, applying PSL inference on r 3 would lead to inaccurate results and convergence failure. Therefore, the existing MLN solutions cannot effectively support the integration of logic formula and arithmetic expression.

2.
Scalability. Arithmetic expressions usually involve pair-wise numerical comparison. The existing MLN solutions would generate the combination of all the predicate variables in the grounding process. This results in the undesirable quadratic or even cubic explosion of grounded clauses, which can easily render the inference process unscalable. For instance, consider the rule r 4 in Table 1. The existing inference solutions would result in an n 2 size of clauses for n variables. It is worth pointing out that clause explosion would not only result in inference inefficiency, but also meaningless inference results. In the circumstance of clause explosion, the techniques based on Gibbs sampling [11,12] may fail because the sampler would be trapped in a local state. As shown in our experimental study, the predictions of PSL may become inaccurate because it fails to converge.

Knowledge Rules Size
Per f ormance(c) 0.2 n r 3 FastCPU(c) 0.9 ∧ HasCPU(p, c) ∧Memory(p) 0.8 ⇒ Per f ormance(p) n r 4 Per f ormance(p 1 ) Per f ormance(p 2 ) ∧Similarprice(p 1 , p 2 ) ⇒ Per f ormancecost(p 1 ) Per f ormancecost(p 2 ) n 2 To address the aforementioned challenges, we propose a novel inference framework called the Numerical Markov Logic Network (NMLN). The framework defines the optimization objective of inference as a novel exp-loss function, which can seamlessly integrate logic and arithmetic expressions. We also present an inference approach of exp-loss function decomposition based on convex optimization and use the technique of ADMM (Alternating Direction Method of Multipliers) to parallelize the inference process for improved efficiency. The major contributions of this paper can be summarized as follows: • We propose a novel probabilistic framework for hybrid knowledge inference. We define the hybrid knowledge rules and present the optimization model. • We propose a scalable inference approach for the proposed framework based on the decomposition of the exp-loss function. • We present a parallel solution for hybrid knowledge inference based on convex optimization. • We empirically evaluate the performance of the proposed framework on real data.
Our extensive experiments show that compared to the existing MLN techniques, the proposed approach can achieve better prediction accuracy while significantly reducing inference time.

Related Work
Probabilistic Programming Languages (PPLs) [13] seek to separate model specification from inference and learning algorithms, thus making it easy for end users to construct probabilistic models in a simple style. Recent PPL platforms, including PyMC3 [14], Edward [15], and Pyro [16], require that the user defines the model structure such as probabilistic graph models (i.e., represents a joint probability distribution for the problem in hand).
The Markov Logic Network (MLN) [1] was originally proposed for combining firstorder logic inference and probabilistic reasoning. Based on the original model, several variants and significant improvements have been proposed. For example, Tuffy [7] was the first system that implemented MLN inference by RDBMS. ProKB [8] proposes a probabilistic knowledge base system allowing uncertain first-order relations and can dramatically reduce the grounding time cost in Tuffy. Deepdive [9] was also an improvement over Tuffy, which has been widely applied to different applications. It provides a powerful knowledge base construction tool and optimizes MLN inference by a combination of statistical inference and machine learning. Our previous work of POOLSIDE [17] proposed a ranking system for commercial products according to their attributes and user comments. Implemented using Deepdive, POOLSIDE provides a naive predefined function to specify the relations between attribute values. The recently proposed variant Quantified Markov Logic Networks (QMLNs) [18] extends classical MLN with a statical quantifiers ∀ * , which provides a kind of quantification describing for example most, few, or at least k thresholds. More recently, Flash [19] exploited MLN to express the Spatial Probabilistic Graphical Model (SPGM), which can perform SPGM predictions efficiently. The MLN has been widely applied to various areas, including activity recognition systems in smart homes [20], root cause analysis in IT infrastructure [21], and natural language understanding [22], to name a few. Note that these systems were all designed for inference on first-order logic rules, but they cannot effectively support the inference on hybrid knowledge rules.
The latest research is mainly focused on the applications. MLNClean [23] was proposed for data cleaning, which is able to clean both schema-level and instance-level errors. The authors of SMLN [24] proposed a framework with native support for spatial data. The paper [25] proposed R-KG, a robot intelligent service, to reason about knowledge based on a Markov logic network.
On the issue of probabilistic reasoning, the MLN mainly focuses on two aspects: inference optimization and model learning. The traditional MLN-based inference techniques suffer from the issue of scalability due to their dependence on the generative model, which embeds all the data and targets in a model. The lifted inference technique [26] was proposed to simplify the MLN network by exploiting symmetry in the model. The authors of [27] proposed a technique to enable large-scale parallel inference by making Gibbs sampling work on the divided networks. The authors of [28] also proposed a query-driven technique that can leverage the local network for query prediction. Moreover, in our previous work POOLSIDE [17], we also proposed an improved query-driven inference algorithm, which exploits the information in the known neighbors to predict the query node. Ground Network Sampling (GNS) [29] proposed in 2016 offers a new instantiation perspective, which can ground from a set of sampled paths at inference time; thus, GNS offers better scalability compared to MLN. Model learning for the MLN includes parameter learning and structural learning. Parameter learning aims to find the optimal weights for a set of rules. This is usually achieved by optimizing different metrics of the objective function [30][31][32]. Structure learning instead aims to learn both logic formulas and their weights, which use the top-down [33] or bottom-up [34] search strategy to find formulas. The authors of [35] proposed a functional-gradient boosting algorithm that learns parameter and structure simultaneously. Since feature representation using neural networks has received much attention from researchers in various domains, neural Markov logic networks [36] also propose to learn the implicit representation of rules using neural networks instead of the explicit rules specified by humans.
To represent fuzzy logic, the MLN models have been extended from the binary field to the continuous field. The hybrid MLN [37] defines and reasons about the soft equality and inequality constraints for first-order relations. Probabilistic Soft Logic (PSL) [10] extends binary variables in the MLN into the continuous range [0, 1]. PSL uses Lukasiewicz logic [38] to compute the truth values of logic clauses. Moreover, PSL allows users to define arithmetic rules, which can be interpreted as constraints on the variables, and transforms the MAP inference into a convex optimization problem. With the help of ADMM [39], the inference can be effectively parallelized and scaled up well to the data size.
However, PSL cannot effectively support the inference on hybrid knowledge rules; the proposed inference techniques thus cannot address the clause explosion issue.

Hybrid Knowledge Rules
The first-order relation consists of a predicate and several predicate variables, e.g., "relation(y 1 , y 2 )", where the "relation" is called a predicate, which represents the relationship between variables, while y 1 and y 2 are called predicate variables. If we replace the predicate variables of a relation with the instance data, the relation can be considered as grounded. In our inference system, each grounded relation is regarded as an inference variable or evidence, which has a truth value at [0, 1] intervals, to indicate whether the relation is held (equal to one) or not (equal to zero).
A hybrid knowledge rule involves both arithmetic and logic expressions. Formally, we define a hybrid knowledge rule by extending the definition of the knowledge rule [10] as follows:

Definition 1.
Suppose that x denotes the set of first-order relation variables and (x) denotes a linear function, which consists of variables in x. A hybrid knowledge rule, r, can be represented by a disjunction form of: where t i denotes a term, which should be one of the following three types: (2) t i is a logic expression, and x i denotes its variables, where

Inference Framework
To introduce our inference framework, we first define the knowledge inference problem as follows: Definition 2. Suppose that r denotes the set of knowledge rules, x denotes a set of variables (including the set of inference variables V and the set of evidence Λ), and Φ j denotes a function defined over variables x, which represents the constraint based on the rule r j ∈ r. The knowledge inference problem is to find a solution V for the variables, such that: In order to define Φ j , we use Lukasiewicz logic [38], which extends binary variables to the continuous field [0, 1], to represent the logic formula. Lukasiewicz logic transforms a logic operator in the following manner: Note that the latest approach of PSL can handle the clauses containing only logic formulas. Based on Lukasiewicz logic, PSL transforms a logic formula into a linear inequality (x) 0, where: is a linear function, which defines the distance of a constraint from being satisfied. Given a logic formula (rule) r in disjunctive form, let I − ⊆ x and I + ⊆ x denote the set of variables with and without the negation prefix "¬", respectively. Formally, the linear function (x) can be represented by: Based on the transformation, it then defines a Hinge-Loss Markov Random Field (HF-MRF), which extends the MLN to the continuous field. The loss function for each clause can be formally represented by: where x denotes the vector of variables, p denotes a user-defined parameter, and (x) denotes the linear function, as shown in Equation (7). Unfortunately, the loss function as defined in Equation (8) cannot handle a hybrid knowledge rule involving both a logic formula and arithmetic inequalities. It can be observed that directly modeling the inference of hybrid knowledge rules by Equation (8) would render its corresponding objective function non-convex.
To integrate all terms in a hybrid rule into a function, we consider the truth value of each arithmetic expression (inequalities) as a continuous logic variable in the interval of [0, 1], which is consistent with its semantic and logic propositions. Formally, we define the truth value for a linear inequality, 0, as follows: sup( ) = ∑ where sup( ) denotes the sum of all positive variables' coefficients β x i and constant c. Note that the linear inequality of 0 can be equivalently transformed into − 0. Figure 1a demonstrates the functional relation between a linear function value and its truth value. As shown in the figure, with a linear inequality being normalized by its supremum, its truth value is equal to the maximal value of one when the inequality is satisfied, and it decreases to zero as the violation reaches the maximum level.  It is noteworthy that the truth value as defined in Equation (9) is consistent with the PSL transformation with regard to Equations (3) and (4). For the negation operator, we define: Our inference framework then defines a linear function for a hybrid rule as follows: where denotes the set of linear inequalities in the rule. Note that the hybrid rules can be directly converted to the PSL loss function formulated in Equation (8), by replacing the linear functions in Equation (7) with Equation (12), such that hybrid rules' inference can be solved by PSL, as we did in our empirical evaluation study. However, such an inference approach causes the clause explosion problem, as we discussed in the introduction. To solve the problem of clause explosion, we instead define an exp-loss function to measure the violation of a rule as follows. (12), and α > 1 denote the base argument, which can be e or other constants. The exp-loss function is defined by:

Proof.
A function twice differentiable is convex iff the Hessian matrix is positive semidefinite. Take: Computing the Hessian: we see it is actually positive semi-definite, because for any λ i , It is worth pointing out that we chose the exp-loss function to measure the violation of a rule due to following reasons:

1.
The exp-loss is a natural extension to the hinge-loss function defined in Equation (8).
The exponential power (x) guarantees a greater loss when a violation of the rule occurs. On the other hand, it can be observed that even though the function is not zero when the rule is satisfied (e.g., if α = e, the loss is e −1 if the rule is satisfied), the value of the exp-loss and its gradient becomes very small in the negative interval, which can be considered as a soft constraint of the max() function.

2.
As shown in the following section, the exp-loss function enables the scalable inference based on function decomposition. It can effectively address the challenge of the explosion of grounded clauses.
Let V denote the set of unknown variables for inference. Given a set of hybrid knowledge rules r and the weight w j with respect to r j ∈ r, the inference target is to minimize the sum of all weighted loss functions generated by all clauses as follows: where g() denotes the operation of grounding and x i denotes the set of variables in the i-th clause. According to Equation (15), each rule r j has the size of g r j clauses in its loss, which are generated by replacing the predicate variables in first-order relations with the possible instances in the data. This process is known as grounding in the existing MLN solution, which is usually implemented by a series of database join operations. Our framework performs grounding for inference optimization, while the MLN performs grounding to generate a factor graph.

Decomposition of Exponential Loss Function
In the scenario of hybrid knowledge inference, grounding the rules, which involves numerical value comparison between two predicate variables, such as the term "Per f ormance(p 1 ) Per f ormance(p 2 )", could easily result in clause explosion. To address this issue, our solution first decomposes the rule relations into groups and then grounds them separately. We illustrate the replacement process by a simple example as follows: Example 1. Given the rule of Frequency(y 1 ) Frequency(y 2 ) ⇒ FastCPU(y 1 ) FastCPU(y 2 ), its loss function (according to Equations (12) and (13)) can be represented by: where Fr and Fc denote the predicates frequency and FastCPU, respectively. The total loss of the rule is estimated by the sum of all the grounded loss functions as follows: It is noteworthy that the total sum of the loss can be decomposed into: Suppose that y i has n instances. Compared to the original form in Equation (16), which requires a computational time of O(n 2 ), computing the loss function in Equation (18) only requires O(2n).
In the general case, where the hybrid rules may contain facts and share common variables, the decomposition may be more complicated. Formally, we define the irreducible groups as follows.

Definition 4.
Suppose that a rule r contains the relations R = {R 1 , · · · , R m } and y i = y i 1 , · · · , y i k denotes the variables in R i . We call R i irreducible if ∀R j = R i , y i y j ; otherwise, there exists a relation R j with y i ⊆ y j , and R i can be reduced to R j . An irreducible group consists of an irreducible relation R i and all the relations reducible to R i , and we denote it byR i . The set of predicate variables shared by two or more irreducible relations is called a joint variable set, denoted by S.
For the decomposition of the exp-loss function, we first split a hybrid rule into multiple irreducible groups. We sketch the procedure for identifying all the irreducible groups and their joint variables in Algorithm 1. For each relation R i , we can find its irreducible groupR i if the relation exists in the groups. Note that a relation might be reduced to more than one irreducible group. However, it can only be assigned to one group. The algorithm simply assigns it to the first irreducible group it meets. An illustrative example of how to split a set of relations into irreducible groups is also shown in Figure 2. In the example, the relations R(y 1 ) and R(y 3 ) can be reduced to the relations R(y 1 , y 2 ) and R(y 2 , y 3 ), respectively. The splitting operation results in totally three irreducible groups. It can be observed that the relations R(y 1 , y 2 ) and R(y 2 , y 3 ) share the variable y 2 , and R(y 4 ) is disjoint to both R(y 1 , y 2 ) and R(y 2 , y 3 ).

Algorithm 1: Find irreducible groups and joint variables.
Input: relations set R = {R 1 , · · · , R m } and predicate variable set y i = y i 1 , · · · , y i k with respect to R i Output: irreducible groupsR and joint variable set S. Now, we are ready to describe how to leverage irreducible groupsR for decomposition optimization. In the proposed inference framework, the first-order relation is represented by a linear function in the exponential term in a loss function. Suppose thatR has k irreducible groups, which is denoted byR j . Then, the linear function (x) can be split into k + 1 parts { 1 , · · · , k , c }, where j is the variables and their coefficients corresponding to the relations inR j , and c is the constant part. Therefore, the loss function can be reformulated as follows: where x ij denotes the variables respecting the i-th grounded relations inR j . To decompose the loss function, we first split all the clauses g(r) that share the same grounded relation in the set of joint variables S into partitions. In each partition, the grounding clause is the combination of all variables in the irreducible groups. As a result, the sum of clauses in a partition can be represented by the product of all the sums in each group. Without loss of generality, we assume all irreducible relations have n instances, and the set of joint variables S has θ instances. The decomposed loss function can be stated as follows: Now, we estimate the complexity of loss computation. The original loss computes all combinations of clauses of irreducible relations, which is O(θn k ). As shown in Equation (20), our proposed technique of function decomposition can reduce the computational complexity from O(θn k ) to O(θnk).
It is noteworthy that the grouped loss function is just a deformation of the original loss function. Each rule in the form of Equation (19) can be converted to Equation (20). According to Equations (12), (13) and (15), the expansion of the loss function is the sum of exponential functions, and all exponential functions have a linear exponent. As a result, the loss function is convex. Our proposed method can effectively find the global optimal. Now, we provide the entire process of hybrid knowledge rule inference in Algorithm 2. The algorithm first generates variables that represent the first-order relations in the dataset and then grounds the clauses for each rule r j in the form of decomposed exp-loss functions. Finally, we use the ADMM algorithm introduced in the following subsection to optimize the sum of losses for all knowledge rules.

Algorithm 2:
Inference of hybrid knowledge rules.
Input: set of hybrid knowledge rules r, relation set R = {R 1 , · · · , R m }, predicate variable set y i = y i 1 , · · · , y i k with respect to R i , and the instances of dataset D. Output: Solution V ∈ [0, 1] n for the inference variables V. Generate the set of variables x according to R and D; for r j ∈ r do Find irreducible groups and joint variables for r j by Algorithm 1; Generate the loss(r j ) for r j (grounding) in the form of Equation (20); end Find the optimal solution V = argmin V∈[0,1] n ∑ r j ∈r loss(r j ) We provide an example of the comparison between our framework and PSL in the Figure 3. In this example, we selected hybrid knowledge used in our experimental study for entity linking, to demonstrate the loss functions in three scenarios: the original PSL hinge-loss and the exp-loss with and without loss decomposition. As shown in the figure, the rule consists of two first-order relations. Each relation in the dataset has three instances. Since PSL does not support hybrid rule inference, we show the loss function when the linear inequality is directly regarded as a logical variable. It is easy to observe that the original PSL loss function cannot guarantee convexity. The decomposed exp-loss function reduces the number of clauses from 3 2 to 2 × 3.

Parallel Optimization
Our decomposition-based method can effectively compute the loss function proposed in Equation (19). In this subsection, we demonstrate how to implement our method in the optimization process. In order to achieve efficient inference, we use the approach of parallel optimization based on the ADMM algorithm. ADMM is a distributed optimization technique that focuses on solving large-scale convex optimization problems. It is generally applicable to the loss function in the form of is a convex function. The main idea of ADMM is to replace the variables in each term with independent local variables and add the constraints on these variables by the augmented Lagrange method. ADMM iteratively optimizes the local variables and updates the consensus global variables until they converge. More details about ADMM optimization is shown in [39].
The total loss function of the inference is the sum of all clauses in the form of decomposed exp-loss functions. For simplicity, we define: as a term of the loss function, such that the total loss is the sum of all terms, which can be formulated as follows: where H is the size of all terms in the loss function. By reformulating the optimization problem with local variables and related constraints by the augmented Lagrange function, ADMM transforms the MAP problem into: where z h denotes a copy of the variables in x h , x h denotes the variables in x that correspond to z h , γ denotes the vector of Lagrange multipliers, and ρ > 0 denotes the step-size parameter. Each set of local variables in z is independent of the others, such that for any two sets of local variables z h and z h , z h ∩ z h = ∅. The optimization process iteratively updates the following three blocks until it converges: The optimization process converges if the local variables converge to the global variables and the global variables converge at the last iteration. Specifically, the two convergence conditions can be represented by: and: where m denotes the total number of variables in x and rt 2 and st 2 denote primal residual and dual residual, respectively. pri and dual are feasibility tolerances for the primal and dual feasibility conditions, and K i is the number of local variables for a variable x i .
Our optimization takes the same steps as shown in Equations (24) and (26). For Equation (25), we follow the traditional ADMM practice to apply the parallel optimization to each clause. However, for each clause, our method does not find the minimal result at each iteration. Instead, it iteratively updates each local variable to its minimal value while fixing the values for other variables. The gradient loss(z) corresponds to the vector composed by the first derivative of each element. Since the local variables are independent for each term: such that we only demonstrate the gradient for a single term. Let z ij denote a local variable, which belongs to the irreducible groupR j . The gradient of z ij can be represented by: where z ij denotes the set of variables that are in the same group with z ij . For the computation of the first term ∏ k j=1 ∑ n i=1 α j (z ij ) · α c , it is obvious that each z ij in the same group R j shares the same product from k − 1 groups. Let f j denote ∑ n i=1 α j (z ij ) , such that: In order to compute the gradient, we first compute the product P(z) and then compute the gradient for each variable as follows: Equation (32) significantly reduces the computation in the optimization process, by sharing the product P(z) for every variable.

Experimental Study
In this section, we empirically evaluate the performance of the proposed solution by a comparative study. We compare the NMLN to PSL, which is the state-of-the-art technique for soft logic inference. PSL has been empirically shown to have the best performance on MLN inference among the existing solutions. More importantly, to the best of our knowledge, it is the only technique that is able to infer hybrid knowledge rules, even though it cannot solve the issue of clause explosion. To enable PSL inferences on hybrid knowledge rules, we replace the linear functions in Equation (7) by Equation (12), such that the rule can be converted to a linear function, which can then be solved by PSL inference. It is noteworthy that other Gibbs sampling-based methods such as Deepdive fail on hybrid rules due to the existence of extremely high-probability states. The sampler would be trapped in a local state, which requires an unacceptable time to sample the correct distribution. We evaluated the performance of different techniques on two real applications: mobile phone ranking and entity linking. We show the statistics of the datasets in Table 2.

Comparative Study
In the comparative study, we set the number of parallel threads at 6, pri = 10 −3 , and dual = 10 −5 in all experiments. For the NMLN, we set the base of exponential function α = e and step size ρ = 0.5 as the default.
Mobile phone ranking: For this experiment, we needed to rank various mobile phones by performance for users. Since the performance evaluation of mobile phones is to some extent a subjective problem, it is difficult to obtain the ground truth. Therefore, we extracted the phone's ranking list from a well-known benchmark website. Available online: ( https://benchmarks.ul.com/ accessed on 3 June 2018), which also lists the specific details of phones such as the CPU, memory, or size. We considered the positions of phones in the ranking list as the annotations to evaluate the inference results. The test dataset contained 899 smart phones. We define the average distance to evaluate the quality of the inference results: where r denotes the results ranked by inference and r * denotes the annotations in the ranking list. This function takes the maximal value of one when the inference results are exactly the same as the annotations. We defined six rules, which were presented in Appendix A, for performance inference. The detailed results are presented in Table 3. They evidently show that the NMLN achieves similar performance to PSL on prediction accuracy, while it requires significantly less inference time. The two methods have a similar accuracy due the rules used in this task being simple; thus, PSL can also give a fine prediction. Entity linking: Our empirical study was conducted on three real benchmark datasets, whose details are described as follows.
• AIDA-CONLL: This dataset was constructed based on the source of CONLL2003 [40], which contains 1393 news articles. It consists of proper noun annotations, which indicate its corresponding entities in YAGO2 [41]. In our experiments, we evaluated all approaches on its testB dataset. • Wiki-Sports: This dataset contains the articles on the topic of sports extracted under the feature article page in Wikipedia. The mentions in the dataset are extracted from the anchor texts in the articles and annotated by the entities to which they link. We used the disambiguation page of Wikipedia to generate the candidates for each mention.
In order to avoid the leakage of label information, we eliminated the corresponding Wiki pages during the extracting link text for the entities. • Wiki-FourDomains: This dataset contains the articles extracted on four topics, which include films, music, novels, and television episodes, on Wikipedia. We applied the same process on the dataset as Wiki-Sports to generate mention-annotations and candidate entities.
In the experiment, we linked a mention in the articles to the YAGO2 entity with the highest inference probability. We first extracted the following six features from the YAGO2 knowledge-base: prior, semantic similarity, coherence, syntax similarity, edit distance, and word2vector similarity. Note that we also eliminated the mention-entity pair candidates, which are obviously not matched, from the inference process. Otherwise, the large number of candidates may cause PSL memory to overflow.
To show the inference capability on a set of decision rules, we made use of the annotations from 300 documents to train a random forest. For each leaf node in the forest, we generated a decision rule, which was formulated as the logic implication of "X → Y", where Y is the leaf node and X is the logic conjunction of all decision nodes in the path from the root to Y. We retained in total 38 rules whose impurity (measured by Gini) was less than 0.025. In addition, we added the rule of link(m, e) 0.2 for every target pair such that the candidates unconstrained by any rule can take a small value. The rules were presented in Appendix B.
The detailed evaluation results are presented in Table 4. It can be observed that the NMLN performs considerably better than PSL on prediction accuracy. The experiment showed that PSL cannot converge to consensus values; thus, it cannot perform well. On inference efficiency, the NMLN also performed considerably better than PSL: the NMLN ends within half an hour, while PSL takes more than 14 h. Now, we provide an analysis of the experimental results. As mentioned in Equation (29), for each term P(x) in the loss function, ADMM transforms the term into P (z, γ, x), by replacing x with local variables z and adds constraints to ensure the local variables converge to x. Assume that P(x) contains n variables and k irreducible groups. The size of the local variables in P(z, γ, x) is n × k. However, the original form in PSL makes the ADMM method construct n k local variables, which means that each global variable x i has k copies in the NMLN, but n k−1 copies in PSL. As a result, although the solution found by PSL is the global optimal for the dual problem in ADMM, its local variables actually do not converge to x; thus, the NMLN outperformed PSL on all datasets.
Evaluation of convergence: In this experiment, we compared the convergence of the two methods on the task of mobile phone ranking. The evaluation results are presented in Figure 4. According to Equations (27) and (28), the optimization process converges if primal residual rt 2 and dual residual st 2 are approximately close to zero. It can be observed that the NMLN is able to converge quickly and stably for both conditions. In Figure 4a, the primal residual of PSL stops decreasing at the value of 64, such that the method cannot converge for both conditions.

Scalability
To evaluate the scalability for the NMLN, we generated synthetic data with various sizes for phone ranking inference. The detailed evaluation results are presented in Figure 5. Since the rules contain two kinds of unknown variables FastCPU(c) and Per f ormance(p), we generated the relations at a ratio of 0.2:0.8. Our experiments show that PSL consumes a large amount of memory. The performance of PSL falls dramatically due to memory overflow when the size of the variable exceeds 4500. Compared to PSL, the NMLN scales much better when the data size increases. As shown in Figure 5a, all inference tasks are finished within two seconds by the NMLN. The NMLN spends most of the time in pre-processing when it runs on small data, such that the runtime does not increase significantly in (a). When the data size is large (more than 5000), we also provide the log scale performance in Figure 5b. It can be observed that the runtime scales in an approximately linear fashion. In the figure, the speed has a slight slow down when the data size is greater than 10,000, which is caused by the sequential operations in the pre-processing phase.
We also present the number of iterations required by both techniques to converge in Figure 5c,d. It can be observed that the NMLN takes 36 iterations on all the tasks with the number of variables varying from 100 to 10 M. The reason is that the average size of the local variables with respect to the same global variables is always a fixed number in the NMLN. In PSL, clause explosion causes a single variable taking more local copies when the size increases.

Sensitivity Evaluation
In this subsection, we evaluate the performance sensitivity of the NMLN w.r.t. the number of parallel threads, the base of exponential function α, and the step size ρ. In our empirical study, except the evaluated parameter, all the other parameters were set to the same values. We ran the evaluation of parallel threads on the synthetic data for scalability evaluation, since the size of the variables has a significant impact on the parallel methods. For evaluations on the parameters of α and ρ, we only present the evaluation results on the original mobile phone rank data due to the reason that different sizes of variables seem to have no effect on the results.
The evaluation results on the number of parallel threads are presented in Figure 6, in which the x-axis denotes the number of variables and the y-axis denotes the percentage of runtime compared with the runtime of non-paralleled method (Threads = 1) spent on the same data as the baseline. It can be observed that the runtime of paralleled inference decreases significantly when the size of the variables is large. Specifically, when the number of threads is set to six, the runtime of inference decreases to 23% and 27%, respectively, on 1000 K and 100 K variables. However, if the variables are smaller than 1K, the runtime decreases only marginally with the increase of the threads. This should not be surprising because small tasks are not suitable for parallelization.
The evaluation results on the base of exponential function α are presented in Figure 7, in which the parameter varies from e to 100. It can be observed that the performance of the NMLN fluctuates only marginally within a long range of α for both the primal residual and dual residual. Therefore, the NMLN inference is stable to take different base values. The evaluation results on the step size ρ are presented in Figure 8, in which the parameter varies from 0.1 to 1.0. It can be observed that the larger value of ρ leads to a faster convergence speed on the primal residual and a slower convergence speed on the dual residual. Thus, the step size ρ should be set to a proper value (0.5) to balance the two conditions.

Conclusions
Current MLN solutions cannot support knowledge inference involving arithmetic expressions. In this paper, we propose the Numerical Markov Logic Network (NMLN) to enable effective and efficient inference of hybrid knowledge involving both logic and arithmetic expressions. We define the exp-loss function as the metric to integrate arithmetic inequalities and logic formulas. By exploiting the decomposition of exp-loss functions, our method reduces the computational complexity from O θn k to O(θnk), such that the inference has a good scalability for the issue of clause explosion. We also present a parallel solution for hybrid knowledge inference based on convex optimization. The proposed approach can achieve better prediction accuracy while significantly reducing the inference time.

Conflicts of Interest:
We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled "Numerical Markov Logic Network: A Scalable Probabilistic Framework for Hybrid Knowledge Inference", which has been approved by all authors. I would like to declare on behalf of my coauthors that the work described is original research that has not been submitted or published in other journals previously, and not under consideration for publication elsewhere, in whole or in part. Table A1. Knowledge rules in the phone dataset.

Knowledge Rules
Weight

Appendix B. Knowledge Rules in the Aida Dataset
We show the meaning of all relations in the rules as follows: •