Outlier Recognition via Linguistic Aggregation of Graph Databases

: Datasets frequently contain uncertain data that, if not interpreted with care, may affect information analysis negatively. Such rare, strange, or imperfect data, here called “outliers” or “exceptions” can be ignored in further processing or, on the other hand, handled by dedicated algorithms to decide if they contain valuable, though very rare, information. There are different deﬁnitions and methods for handling outliers, and here, we are interested, in particular, in those based on linguistic quantiﬁcation and fuzzy logic. In this paper, for the ﬁrst time, we apply deﬁnitions of outliers and methods for recognizing them based on fuzzy sets and linguistically quantiﬁed statements to ﬁnd outliers in non-relational, here graph-oriented, databases. These methods are proposed and exempliﬁed to identify objects being outliers (e.g., to exclude them from processing). The novelty of this paper are the deﬁnitions and recognition algorithms for outliers using fuzzy logic and linguistic quantiﬁcation, if traditional quantitative and/or measurable information is inaccessible, that frequently takes place in the graph nature of considered datasets.


Introduction
In collecting and processing information, there appear some uncertainties, mostly incomplete or imprecise data. Sources of uncertainty are usually measurements, probabilistic (stochastic uncertainty), lack of credibility, and linguistic descriptions (natural language uncertainty). To take care about such data, that, although strange, may contain valuable and exceptional information, one can consider them to be outliers . An "outlier" or "exception" (also: deviation, anomaly, aberration, etc.) means an observation that is rare, special, unique, unexampled, or infrequent. These terms mean that properties of interest possessed by outlying objects, are specific to recipients considering/processing them. Outliers are especially noticeable as highlighted or unusual observations on a background of numerous phenomena/objects similar one to another, typical, or ordinary. Unrecognized outliers in data exploration and mining, may decrease reliability of analysis, increase data imprecision and noise. In other words, outlying objects may distort or blur the final gists or meaning of collections analyzed. On the contrary, appropriately recognized outliers can bring unique information on change of activities and congestion in networks or intrusions into them, illegal use of debit/credit cards, serious damages of production lines, rapid changes of patients' health status and parameters of medical devices, etc.
The literature enumerates various definitions of outliers, mostly subjective, intuitive, and dependent on different numerical characteristic showing "how much" considered objects are atypical for analyzed databases or sets. For example, the definition of outlier by Hawkins, the most frequently quoted, is [1] "An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism". Additionally, in [1], an outlier is defined as: "any object x in the space X , which has some abnormal, unusual characteristics in comparison to other objects from X ". Next, in [2], the authors say that outliers are "noise points outside the set which determine clusters", or, alternatively, as "points outside the clusters but separated from the noise" . Next, according to [3,4], "a point p in a data set is an outlier with respect to parameters k and λ, if no more than k points in the data set are at a distance λ or less from p", with k ∈ N, λ ∈ R. The λ parameter can be interpreted in terms of relations between objects in {x 1 , . . . , x N }, N ∈ N, not only as a metric, distance, a semantic connection (e.g., similarity as a binary fuzzy relation), etc. Moreover, another definition for outliers are worth mentioning from [5][6][7], and global and local outliers proposed in [8][9][10]. Interesting applications and techniques as far as approaches mixing outliers detection with clustering methods are proposed in [11,12]. Outliers in data streams (series or linear structures) are considered in [13]. Recognition of outliers is also the subject of consideration in [3,14,15] and many other.
The most important idea on the above literature review, is that it does not provide one objective or axiomatic definition of an outlier at all. This is the specificity of this domain of research: outliers can be defined in different manners, with different characteristics and parameters, and in specific relations to particular problems, issues, datasets, etc. Thus, in this contribution, we focus on outliers defined in terms of linguistic quantification and fuzzy sets [16][17][18][19], since no publications on outliers handled this way can be found. Recognizing outliers via fuzzy quantification and linguistic information can be useful when numerical information and traditional quantitative terms are inaccessible for a given set of objects. In such situations, the only information to detect anomalies is human experience and expert knowledge expressed linguistically (which is, in general, a common and obvious reason to apply fuzzy systems and techniques in various issues). Moreover, in our previous papers we focused on outliers in relational databases, and now, the main novelty of the paper is a successful attempt of using the methods for graph-oriented datasets. It is worth mentioning that graph datasets are frequently applied in circumstances in which relational structures are insufficient or unable to represent data and their meaning properly to the required context (e.g., in Customer Relationship Management Systems, CRM, or in social media).
The paper is organized as follows: Section 2 is a list of preliminary definitions and operations in fuzzy logic and the linguistic quantification of statements. Section 3 reminds our definitions for outliers in terms of linguistic information and, based on these definitions, algorithms for outlier detection/recognition. The specificity of preprocessing graph datasets to use fuzzy methods of recognition is illustrated in Section 4. An implemented example of outliers recognition is given to show how the proposed methods work on real graph datasets (here: a database on consumer complaints in a CRM system in a bank [20]) in Section 5. Finally, in Section 6, we discuss on future possibilities of works in the field presented.

Fuzzy Sets and Linguistic Quantification of Statements
In this section, we briefly review the basics of the linguistic quantification of statements in the sense of Zadeh [21]. A fuzzy set A in a finite non-empty X = {x 1 , . . . , x ∈ X }, and µ A : X → [0, 1] is its membership function. The intersection of fuzzy sets A, B in X is a fuzzy set A ∩ B in X : where t is a triangular norm, e.g., min or product. The cardinality of A, the so-called Σcount(A), sigma-count, is defined as [22]: A relative cardinality of A with respect to a fuzzy set B is proposed: For fuzzy set A in a continuous and uncountable universe of discourse Y, the following counterpart of (2) is proposed: where "clm" comes from "cardinality-like measure". Finally, the support of A in X is a non-fuzzy set in X : The generalized support of A is called α-cut, alpha-cut, and replaces the right side of inequality in Equation (5) Alpha cuts are non-fuzzy sets and they are necessary to define the convexity of fuzzy set A in X : A is convex in X iff each of its alpha-cuts is convex non-fuzzy set. A in X is normal iff sup x∈X µ A (x) = 1.
Assume, S, W are linguistically expressed characteristics of objects, represented by fuzzy sets in a finite D= {d 1 , d 2 , . . . , d N }, N ∈ N. Q is a relative fuzzy quantifier describing quantities of objects as ratios to a superset (usually, the universe of discourse of Q), e.g., "many of", "about 1/3", "less than half", "very few", and Q is modeled by a fuzzy set that is convex and normal in [0, 1] [23] (see Equations (23)- (25) as examples of relative fuzzy quantifiers). The first and the second form of linguistically quantified statements by Zadeh [21], are: Q d's being W are S, respectively. Their degrees of truth in terms of fuzzy logic are evaluated via Equation (3) as: It must be noticed that in the context of detecting outliers, only the so-called regular relative linguistic quantifier Q with the monotonically non-increasing µ Q is taken into account, i.e.: Moreover, quality of the fuzzy quantified statements can be additionally evaluated with the given measures of quality of fuzzy quantifier Q: T supp and T clm [23,24]. They are based on support (5) and on the clm measure (4), respectively, of a fuzzy set that represents the fuzzy quantifier Q: Both measures depend on the presented characteristics of Q and their meaning is: the closer to 1, the more precise quantifier Q is.
Naturally, the choice of representations for quantifier Q and characteristics (properties) S, W depends on the specificity of databases being analyzed, and, in particular, on linguistic information provided by experts. In Section 3, linguistically quantified statements (6) and (7) are essential for proposed definitions of outliers, and for detecting/recognizing outliers in graph-oriented datasets.

Outliers in Terms of Linguistically Aggregated Information
Now, we present the definitions of outliers based on fuzzy representations of linguistic information, i.e., by linguistically quantified statements with their degrees of truth.
Definition 1 (Outlier via the first form of linguistically quantified statement). Let D = {d 1 , . . . , d N }, N ∈ N, be a finite non-empty set of objects. Let S be a linguistic expression characterizing objects in D and represented by a fuzzy set in D. Let Q be a relative regular non-increasing linguistic quantifier (e.g., "almost none", "very few", "only several", or synonymous) represented by a fuzzy set in [0, 1], and α ∈ [0, 1]. An object d ∈ D is outlier iff: The degree of truth of (14) is evaluated via (8): The definition of an outlier in terms of the second form of a linguistically quantified statement (7), i.e., taking into account two properties, S and W possessed by objects in D, is introduced: Definition 2 (Outlier via the second form of linguistically quantified statement). Let D is defined as in Definition 1, and S, W are linguistic expressions characterizing objects d ∈ D and represented by fuzzy sets in D. Let α ∈ [0, 1], and Q-a relative regular non-increasing linguistic quantifier as in Definition 1. An object d ∈ D is outlier iff: The degree of truth of (16) can be evaluated with Equation (9) as: Two algorithms for detecting outliers are now presented, related to Definitions 1 and 2, respectively. They are designed as tools for detecting outliers in datasets (anomalies, exceptional data, etc.) in circumstances when only imprecise and linguistically formulated knowledge of their specificity is available. In particular, objects d in an analyzed dataset D are considered to be outliers, if they are intuitively characterized by expressions such as "small", "big", "hot", "very expensive" etc., represented by the S, W fuzzy sets, and the quantity of them is either not determined precisely, but linguistically expressed with statements as "very few", "almost none", represented by Q. Hence, the algorithms confirm that outliers exists in a dataset, iff statements "Q x's are S" or "Q x's being/having W are S" are of sufficiently large (larger than threshold α) degree of truth. The common assumptions for Algorithms 1 and 2 are: S, W-linguistic labels for properties of d's in D, represented by fuzzy sets, 3.
α ∈ [0, 1]-an arbitrarily chosen threshold for degrees of truth for (14), (16) To detect outliers in D using Algorithm 1, we need the entry query in the form of: How many d s are S? (18) to detect outliers in D with respect to the S property and using linguistic quantifiers Algorithm 1 Detecting outliers via the first form of linguistically quantified statement.
1: for all k = 1, 2, . . . , K do 2: are side effects of the algorithm, and it is important in Algorithm 3 recognizing outlying d's (if detected) in D, see Section 3.2. Now, Algorithm 2 referring to Definition 2 is presented: outlying objects in D are here detected on the base of two, possibly overlapping, linguistic characteristics S, W. To detect outliers in D using Algorithm 2, the entry query is necessary: T k ← µ Q k (rn/rd) 7: if not T 1 > α and not T 2 > α and . . . and not T K > α then return "NO OUTLIERS IN D" 8: else return "THERE EXIST OUTLIERS IN D" As with Algorithm 1, the side effect of Algorithm 2 are K linguistically quantified statements with their degrees of truth:

Recognizing Outliers via Linguistic Information
Here, the next two algorithms, now for recognizing and enumerating outlying observations d ∈ D possessing properties S, W, are introduced. The outlier detection tools, Algorithms 1 and 2, presented in Section 3.1, confirm only that some outliers do exist in dataset D (true) or do not exist ( f alse). However, subsets of outliers D out ⊂ D remain unspecified. Hence, now we deal with algorithms accomplishing the following task: recognizing and enumerating particular objects in D that are outliers with respect to the S, W characteristics. in other words, via the algorithms presented here, the subsets of outliers D out ⊂ D with respect to S, W, are determined.
First, we take into account outliers according to Definition 1, and assumptions and symbols are as for Algorithm 1. Hence, for given dataset D = {d 1 , d 2 , . . . , d N }, N ∈ N, properties S, W, linguistic quantifiers Q 1 , Q 2 , . . . , Q K , K ∈ N, and parameter α ∈ [0, 1], Algorithm 3 is based on the query (18). Of course, Algorithm 3 is fired iff Algorithm 1 did detect anomalies in D (since no point to seek for objects in D out = ∅, otherwise). The result of firing Algorithm 3 is the collection of outlying observations in D selected with the S characteristics and the query in the form of (18).
1: declare D out = ∅ 2: for all n = 1, 2, . . . , N do 3: if µ S (d n ) > α then D out ← D out ∪ {d n } 4: return D out Per analogiam, subsets containing anomalies in D can be determined via Definition 2, with the same assumptions as for Algorithm 2. Hence, the Algorithm 4 is proposed with the query on its input given by (20). Its result is returned as an array of indices of found anomalies d n ∈ D, n ∈ {1, 2, . . . , N}.

Preprocessing Graph Databases to Use Linguistically Aggregated Information
Now, we briefly explain how filtering graph database is done. The aim is to select/filter objects of given types (e.g., clients, transactions, transfers) and represent them as a onedimensional set (e.g., sequence, collection, list). It is important to notice that objects in a resulting sequence need not to be counterparts of every single vertex of a given type, e.g., a resulting object is not necessarily related to one "complaint" vertex, but rather to a set of data combined from several vertices/edges/properties describing a particular fact, here: a complaint. Next, selected objects are inputs for the method of outlier detection via linguistically quantified statements (described in Section 3). We assume that graph database D is represented by directed labeled graph G in which V (or V(G)) is the set of vertices of G, E (or E(G)) is the set of edges of G, and L (or L(G)) is the set of labels of G. It is t is important that both vertices and edges can have labels assigned, so G can be vertex-labeled or edge-labeled, respectively. In this case, L(G) can be divided into two subsets VL(G) (vertices' labels) and EL(G) (edges' labels). A label of a vertex determines its type, and a label of an edge determines relation between two vertices. Finally, P(V, E) is a set of properties that can be possessed by vertices or edges, e.g., amount of transfer, date of complaint, etc. (in fact, properties are counterparts of attributes in relational data models).
The graph database of Customer Relationship Management (CRM) system taken into account in the experiment, handles complaints submitted by clients of a bank [20]. The vertices of graph G represent data on customers' complaints submitted to CRM. The sample structure of graph G is illustrated in Figure 1. The keypoint of transforming selected nodes to a sequence is to construct and execute the query. Notice, it is not a simple serialization or graph searching, since objects in the resulting sequence may be described by properties of several different vertices and/or edges. The structure of a single object in the resulting sequence is determined by a specific query R in the general form of: R(VL R , EL R , P R ), (22) where VL R ⊆ VL(G) is the subset of vertices' labels, EL R ⊆ EL(G) is the subset of edges' labels, and P R ⊆ P(V, E) is the subset of vertices'/labels' properties. The Neo4j database management system is used in the experiment [25], and one of queries executed to obtain selected vertices on complaints as sequence D, is based on the MATCH clause (a counterpart of SELECT in SQL): year, complaint.day," + "complaint.month, r.name, company.name," + "TO.disputed, TO.timely, p.name, t.name, s.name"); where VL R = {COMPLAINT, PRODUCT, RESPONSE, COMPANY, SUBMITTED, TAGS}, EL R = {WITH, ABOUT, VIA, TO, AGAINST}, and P R = P(V, E), see (22).
As the result of executing this query on the given graph dataset, sequence D = {d 1 , . . . , d N }, N = 40, 083, of objects representing complaints is selected. Sample records of the sequence are illustrated in Table 1. Table 1. Sample records of dataset D obtained as a sequence from the graph database processed, see Section 4 and Figure 1.

Application Example
The proposed algorithms for outlier recognition via linguistic information are now implemented on the sequence obtained (see Table 1). Because of specificity of data and possible connections/relations between them, a graph representation has been chosen for the set of complaints submitted. The general schema of the experiment is to filter (select) the vertices representing complaints themselves and represent them as the sequence of objects, the parameters of which are inputs for algorithms detecting outliers (see Section 3).
Next, properties of interest of these objects are fuzzified, which means their crisp values are assigned to labels and corresponding fuzzy sets: "date received" to {early spring, middle spring, summer, autumn, early winter, winter} in X 1 , "county per capita income" to {poor, middle, rich} in X 2 , and "time of sending complaint" to {short, average, long} in X 3 , see sample linguistic values of chosen properties of objects in D (23)-(25). S 1 represents label "early spring" in X 1 = {1, 2, . . . , 366}-days in a year with µ S 1 given and 0 otherwise. S 2 represents label "rich county" in X 2 = {0, 1, . . . , 70}-per capita income (in USD thousands) in the county the submission comes from, with µ S 2 (x): and 0 otherwise. S 3 represents label "average time" in X 3 = {0, 1, . . . , 30}-numbers of days between receiving and sending the complaint to a company by CFPB (Consumer Financial Protection Bureau), with µ S 3 (x): and 0 otherwise. S 4 is a non-fuzzy set representing one of the labels: {Older American, Servicemember, Older American and Servicemember, none}. The relative linguistic quantifiers proposed to be applied in Algorithm 2, according to Definition 2, are Q 1 = "very few", Q 2 = "close to 0", Q 3 = "almost none". They are illustrated in Figure 2 and their membership functions for r ∈ [0, 1] ⊂ R are: (28) Figure 2. Membership functions of linguistic quantifiers Q 1 = "very few", Q 2 = "close to 0", Q 3 = "almost none", i.e., (26)- (28), respectively, used in the linguistic summaries (see Table 2). Now we use Definitions 1 and 2 to detect outliers in D. We use S and W as non-empty combinations of S 1 , S 2 , S 3 , e.g., S 1 AND S 2 , S 1 AND S 3 , etc. Hence, queries R in the form of (18) and (20) are needed to fire Algorithms 1 and 2, e.g.: How many (Q) complaints submitted in spring (W) come from rich county (S)? (29) Finally, 49 queries are formulated, and since we operate on 3 fuzzy quantifiers Q 1 , Q 2 , Q 3 substituting Q in (29), 3 × 49 = 147 linguistically quantified statements are generated see Table 2. The α = 0.9 threshold is arbitrarily chosen to distinct statements with the largest degree of truth (see Definition 2): statements 145. and 147. are found as sufficiently true to detect some outliers (lines bolded in Table 1). In both cases Algorithm 4 is used to determine particular outlying objects in D, because the statements are in the form of (6). Two sets of outliers are finally recognized, D out1 , D out2 . Objects with IDs in D out1 = {801,691; 801,371; 375,975} are outliers detected by statement 145, and objects with IDs in D out2 = {663,648; 210,516; 253,242; 673,669; 305,167; 716,577} are outliers detected by statement 147. These choices were checked and confirmed by experts as outlying objects. Table 2. Linguistically quantified statements 1.-147. generated with degrees of truth T and T supp , T clm measures (see (12) and (13)). The statements 145. and 147. that detected outliers, are bolded.

No.
Linguistically  Moreover, one interesting observation must be noticed here: in Table 2, linguistic expressions 145. and 146 have the same S ("come from rich county AND are sent by CFPB in an average time") and W ("submitted in early spring") properties, but different linguistic quantifiers ("almost none" and "close to 0", respectively) are used. As the result, the former is qualified as detecting possible outliers and the latter is not. Obviously, it depends on the membership functions of the quantifiers, so one may conclude that testing different fuzzy representations of expert knowledge appears crucial for final results of detection.

A Comparison to the LOF Algorithm
The D sequence containing 40,083 objects is now the entry for the LOF algorithm (Local Outlier Factor) for detecting outliers [26]. The Python libraries: scikit-learn [27] and pandas [28] are applied in computations. The sets of parameters for LOF and numbers of outliers detected are given in Table 3:  Table 3 illustrates different parameters of LOF taken into account to analyze the given sequence of data. It must be underlined that only raw numerical data are analyzed, since there is no possibility to feed LOF with linguistically expressed knowledge. As it is seen, the number of outliers found by LOF varies from 0 to over 3000. Moreover, only the correlation used as a metric and very small contamination provide numbers of outliers similar to the fuzzy algorithms proposed. However, outliers found by the LOF algorithm (that does not use fuzzy sets) are different from the outliers detected by our algorithms, and the most probable explanation is that traditional algorithms do not use linguistically expressed knowledge. The conclusion is that mutual applying both methods, traditional and the one proposed, is worth considering, to recognize all outlying objects.

Conclusions
In this paper, we introduce a novel method of outlier detection and recognition in graph datasets, when only linguistic and/or imprecise knowledge is available to differ suspected objects on the background of regular, typical data. The method is applicable when no quantitative or measurable information is accessible (and, thus, outlier definitions by Aggarwal, Knorr, etc. would not work), but when it is possible to create fuzzy models, i.e., fuzzy representations of expert knowledge, based on raw numerical data (which is a common practice in fuzzy computations). Specific processing of graph databases is taken into account, to make it readable for fuzzy methods. An illustrative implementation example is provided to show how graph data can be processed by fuzzy representations of linguistic information and, finally, to point at particular objects as recognized outliers. In other words, we show how the question "which objects are outliers in D?" can be answered, and not only "are there outliers in D or not?". Algorithms 1 or 2 can confirm that outliers are present in D, and the subsets of outlying observations D out in the analyzed D are determined by Algorithms 3 or 4 taken into account the degrees of truth of linguistically quantified statements generated by Algorithms 1 or 2 as their side effects, see (19) and (21). Finally, we would like to underline that the approach proposed to the issue of detecting and recognizing outliers in datasets, especially, its novelty based on linguistically quantified statements interpreted in terms of fuzzy sets, were not applied, up to now, to graph datasets.
Currently, our further research on recognizing outliers is in progress, mostly using multisubject linguistic summaries, cf. [29], and analyzing other non-relational databases. Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: [20].