Detection of Outliers via Uncertain Knowledge and the IF–THEN Method

Kacprowicz, Marcin; Niewiadomski, Adam

doi:10.3390/app152312833

Open AccessArticle

Detection of Outliers via Uncertain Knowledge and the IF–THEN Method

by

Marcin Kacprowicz

^*

and

Adam Niewiadomski

Institute of Information Technology, Lodz University of Technology, Al. Politechniki 8, 93-590 Lodz, Poland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(23), 12833; https://doi.org/10.3390/app152312833

Submission received: 26 June 2025 / Revised: 25 November 2025 / Accepted: 27 November 2025 / Published: 4 December 2025

Download

Browse Figures

Versions Notes

Abstract

In data mining and exploration, outliers are specific and infrequent data that require special attention, as they may reveal potentially hazardous information. Detecting outliers can support, e.g., identification fraudulent credit card usage or unauthorized access to transactions, even hacking banking systems, etc. The paper proposes a definition of outlier in terms of fuzzy representations of expert knowledge and its application to detect outliers. The approach proposed has the potential to enhance the performance of outlier detection in various fields, including finance and banking data storage and analysis. By “enhance” we mean that the intention of the new method is to cooperate with known numerical methods, e.g., LOF, rather than supersede or deprecate them. The usefulness of the method is proven via providing new outlying observations for given datasets using input data expressed in an imprecise, linguistic manner.

Keywords:

outliers in databases; fuzzy IF-THEN rules; detection of outliers; outlying objects; outstanding data; anomaly detection; fuzzy logic

1. Introduction

The rapid development of information technology has led to an enormous accumulation of data. The sheer volume of data renders analysis by human agents increasingly challenging. Processed data are often plagued by imprecise or incomplete entries. Frequently, models characterizing data are unavailable, and it complicates the analysis process even more. Uncertainty stems from various sources, such as stochastic or linguistic factors. This uncertainty can significantly impact the data analysis, particularly data that is rare, unusual, or singular, regardless of its origin. Managing uncommon or uncertain object data often involves considering them as outliers or exceptions. Hawkins defines an outlier as “an observation that deviates significantly from other observations, arousing suspicion that it was generated by a different mechanism” [1]. Despite its deviation and/or differences from other entries, an outlier, characterized as unique, rare, or exceptional, should not be overlooked, as it may contain (and frequently does) valuable, uncommon information that needs attention. Outliers are particularly conspicuous as they deviate from numerous similar, typical, or ordinary objects. In data mining and exploration, unidentified outliers may obscure or distort the overall essence of analyzed collections. On the other hand, correctly identified outliers may offer unique insights into various scenarios, such as network intrusions, credit card misuse, or abrupt medical equipment parameter shifts indicative of patient health [2,3,4,5,6,7,8]. The visibility of properly handled outliers is based on their distinctiveness from the majority of similar or common data entries. The accurate identification of outliers can offer crucial insights into numerous domains, from signaling network intrusions and fraudulent financial activities to swift changes in medical equipment parameters reflecting critical health conditions [8,9]. Their misidentification, however, can obscure the true essence of the analyzed data, complicating and potentially distorting overall findings. The article builds upon the authors’ previous work [10,11,12,13], which explored the use of fuzzy logic and linguistic variables for detecting exceptional objects in structured and semi-structured datasets. These studies laid the foundation for the current method by introducing fuzzy IF–THEN rules, coverage-based evaluation, and linguistic modeling of heterogeneous data.

Novel Contributions of This Work

This paper extends and deepens our prior work on fuzzy rule-based outlier detection [12,13], and introduces the following novel aspects:

We propose an enhanced decision criterion for outlier rules by requiring that at least three out of four fuzzy implications (Product, Łukasiewicz, $I K_{1}$ , $I K_{2}$ ) meet specific thresholds in Section 4.
The paper introduces and utilizes three distinct S-shaped functions to compute the degree of sufficient coverage, which were not included or described in earlier works.
We provide an extended empirical evaluation, including a comparative analysis with the Local Outlier Factor (LOF) method under multiple parameter configurations, highlighting the strengths and limitations of both approaches in Section 5.
The proposed method is validated on a large real-world dataset and the results were consulted with domain experts in the banking sector, strengthening the practical significance of our approach.

2. Literature Review

Outlier detection in complex networks is crucial for applications across various domains such as social, communication, and biological networks. Traditional methods often struggle with the complexity and dynamic nature of these networks. Recent research, including the work by Wang et al. [14], explores innovative approaches using intuitionistic fuzzy set ensembles, enhancing detection accuracy through ensemble learning. Extensive experiments validate the superior performance of this method in diverse network types. Suresh and Kannan [15] address the challenge of detecting outliers in fuzzy time series by employing a fuzzy approach to identify unusual observations. Experimental results confirm the efficiency of this method in various types of time series data. Similarly, Cateni, Colla, and Vannucci [16] utilize fuzzy logic for outlier detection, leveraging its ability to handle uncertainty and imprecision in data analysis. In computer security, Harish et al. [17] propose a modified fuzzy clustering technique for outlier-based intrusion detection, showing effective results in ensuring data integrity and protection against attacks. Garg and Batra [18] introduce an ensembled technique combining multiple outlier detection methods, demonstrating improved effectiveness and resistance to false alarms in various datasets. Novaes et al. [19] employ Long Short-Term Memory (LSTM) and fuzzy logic for outlier detection in Software-Defined Networks (SDN), achieving effective detection and recognition of outliers by the use of IF–THEN rules for outlier identification and mitigation. Mungara et al. [20] use fuzzy neural networks for detecting cyber anomalies, showcasing real-time detection and response to cyber threats. Moniruzzaman and Hossain [21] provide a comprehensive analysis of NoSQL databases in the context of Big Data analytics, highlighting their schema flexibility, scalability, performance, and availability.

Bartczak and Niewiadomski in [10] introduced the concept of using linguistic summaries in customer relationship data stored in graph databases. This foundational work inspired later developments in graph-based fuzzy models, which are further explored in the present study. Subsequently, Bartczak and Kacprowicz in [11] explored the application of fuzzy logic to the detection of anomalies in banking datasets. The current research builds on this idea by formalizing the concept of outlier rules and proposing improved detection thresholds. Niewiadomski, Kacprowicz, and Bartczak in [12] proposed an early prototype of using IF–THEN fuzzy rules for anomaly detection in graph databases. The present paper extends that approach by introducing enhanced rule evaluation criteria and S-shaped membership functions. Kacprowicz, Bartczak, and Niewiadomski in [22] examined the construction of fuzzy sets for CRM systems. These techniques were employed in the current work to define membership functions for fuzzy rules. Finally, Bartczak and Kacprowicz in [13] presented preliminary results of outlier rule detection. In the present study, this line of research is significantly extended through the introduction of new fuzzy implications (

I K_{1}

,

I K_{2}

) and refined decision criteria.

3. An Outlier in Terms of Fuzzy Rules

3.1. Clarification of Terminology

To avoid ambiguity, we distinguish between the following two terms used throughout this paper:

Outlier rule: A fuzzy rule $R_{k} \in R$ is considered an outlier rule if it satisfies the conditions

$O (R_{k}) \geq κ and C \leq C_{\max}$

where $O (R_{k})$ is the degree of outlierness, $κ \in (0, 1]$ is a fixed threshold (typically 0.95), and C is the degree of sufficient coverage (e.g., $C_{\max} = 0.1$ ).
Outlying object: An object $d_{i} \in D$ is an outlying object if it activates at least one outlier rule $R_{k}$ , i.e., the fuzzy membership functions associated with the antecedent and consequent of $R_{k}$ both yield non-zero values for $d_{i}$ , and $R_{k}$ is marked as an outlier rule under the criteria above.

This distinction is essential: the proposed method identifies rules that deviate structurally within the dataset (outlier rules), and from them, the method infers the corresponding outlying observations (outlying objects).

The main objective of this article is to present methods for outlier detection, which is based on fuzzy logic and IF–THEN rules. Here, we propose the definition of an outlier in terms of fuzzy logic:

Let

D = {d_{1}, d_{2}, \dots, d_{N}}

,

N \in N

, be a finite, non-empty set of objects. Likewise,

R = {R_{1}, R_{2}, \dots, R_{K}}

,

K \in N

, denotes a set of fuzzy rules represented in the form “IF

V_{1} (d_{i})

is

A_{k}

, THEN

V_{2} (d_{i})

is

B_{k}

”,

i = 1, 2, \dots, N

.

A_{k}

, and

B_{k}

stand for the antecedent and consequent of

R_{k}

, respectively, both represented by fuzzy sets within

D

. Once a dataset is specified, all possible antecedents and consequents of rules can be created, up to the user/expert knowledge, with labels and fuzzy sets.

Definition 1

(An outlier in terms of fuzzy rules:). Let

κ \in (0, 1]

. An object

d_{i} \in D

,

i = 1, 2, \dots, N

, is an outlier iff it activates rule

R_{k}

,

k = 1, 2, \dots, K

, satisfying the condition

O (R_{k}) \geq κ .

(1)

For a given

k \leq K

, the degree of outlier of

R_{k}

,

O (R_{k})

, is defined according to Kosko [23] and Berg [24] as

O (R_{k}) = \{\begin{matrix} \min {\max {T, 1 - T}, 1 - C}, & T > 0 \\ 0, & T = 0, \end{matrix}

(2)

where T represents the degree of truth, see Equation (3), also known as the conditional and unqualified proposal, as mentioned by Yuan [25], and C implies a certain threshold in the context of detecting objects firing the

R_{k}

rule; see Equation (4). When

T > 0

, the value of

O (R_{k})

is determined mainly by C. When

T = 0

, meaning complete absence of truth, the outlier degree is explicitly set to 0, meaning that the given rule is not able to determine any outlier among objects firing it. Among the implications considered, the Product implication is defined as

μ_{A_{k} \to B_{k}} (d_{i}) = μ_{A_{k}} (d_{i}) \cdot μ_{B_{k}} (d_{i})

This implication captures the degree of joint membership to both antecedent and consequent fuzzy sets using the standard product t-norm.

The degree of truth T, used in Equation (2), is defined as follows:

T = \frac{\sum_{i = 1}^{N} μ_{A_{k} \to B_{k}} (d_{i})}{\sum_{i = 1}^{N} μ_{A_{k}} (d_{i})},

(3)

where

A_{k} \to B_{k}

is a fuzzy implication, e.g., the product t-norm, the Lukasiewicz implication, or one of the proposed engineering implications

I K_{1}

(Equation (13)),

I K_{2}

(Equation (14)); see Section 4. The degree of truth T is a pivotal measure to evaluate an aggregated value of a fuzzy implication for all fuzzy rules from set

R

.

The parameter C, referred to as the degree of sufficient coverage [26], plays a crucial role in determining the activation of a rule for a sufficiently large number of objects

d \in D

,

C = \{\begin{matrix} 0, & r < r_{1} \\ g (r), & r_{1} \leq r \leq r_{2} \\ 1, & r > r_{2} \end{matrix},

(4)

where r is defined in Equation (5) as the evaluation of the ratio of objects showing a certain relevance to both the antecedent set

A_{K}

and the consequent set

B_{K}

,

r = \frac{1}{N} \sum_{i = 1}^{N} t_{i},

(5)

for

t_{i} = \{\begin{matrix} 1, & if μ_{A_{k}} (d_{i}) > 0 and μ_{B_{k}} (d_{i}) > 0 \\ 0, & otherwise, \end{matrix}

(6)

and

g (r) : [0, 1] \to [0, 1]

—an S-shape function in which the choice of

r_{1}

and

r_{2}

establishes the transition points of the function, guiding the transformation between different degrees of activation within the input range. The function

g (r)

determines the shape of the curve between

r_{1}

and

r_{2}

, allowing flexibility in modeling the activation behavior. S-shaped functions are commonly used to describe gradual transitions or activation levels within a given range. Its adaptability in defining different activation states makes it a valuable tool in numerous areas such as fuzzy systems, decision-making, and pattern recognition, where gradual or stepwise transformations are essential.

The

t_{i}

values indicate the relevance (if 1) or non-relevance (if 0) of individual objects

d_{i}

to both

A_{k}

and

B_{k}

. If the membership degrees to

A_{k}

and to

B_{k}

are greater than zero,

t_{i}

is set to 1, indicating that the object is able to fire the

R_{k}

rule. This coverage criterion helps to assert whether the rule has sufficient support in the dataset to be activated or applied to a significant number of objects. The value of C determined by r, see Equation (5), is related to specific characteristics of the dataset and to the chosen context of the analysis performed. It sets a threshold to ensure that a rule is activated only when attributes represented by r show a predefined degree of coverage.

An implementational example of the presented IF–THEN method based on Definition 1 is given in Section 4 and the discussion of results and their comparison to traditional numerical methods in Section 5.

3.2. Generating of Fuzzy IF–THEN Rules

The construction of fuzzy decision rules in the proposed approach follows an algorithmic and domain-independent procedure rather than manual rule crafting. Once the linguistic variables, their universes of discourse, and associated fuzzy sets (membership functions) are defined for a given problem domain, the algorithm systematically generates all possible combinations of antecedent and consequent pairs. In practice, this process consists of the following steps:

Definition of linguistic variables: Each input and output attribute relevant to the problem is expressed as a linguistic variable (e.g., “season”, “income level”, “average response time”).
Specification of fuzzy sets: For every linguistic variable, fuzzy sets are assigned (e.g., “low”, “medium”, “high”) along with their membership functions over the defined universe of discourse.
Combinatorial generation of rules: The algorithm constructs candidate IF–THEN rules by combining each possible antecedent fuzzy set with each possible consequent fuzzy set across all variables. This step yields the complete set

$R = {R_{1}, R_{2}, \dots, R_{K}},$

where K depends on the number of variables and fuzzy sets defined.
Evaluation and selection: For each generated rule $R_{k}$ , the degree of truth T and degree of coverage C (as defined in Equations (3) and (4)) are computed. Rules satisfying the thresholds for outlierness (Equation (1)) are retained as outlier rules; the remaining rules are discarded.

This combinatorial rule generation ensures that the method does not rely on manual rule definition, making it scalable to new datasets as long as the linguistic variables and their fuzzy partitions are specified. It also facilitates reproducibility, since the same generation procedure can be applied automatically to any dataset with equivalent variable definitions.

4. Detecting Outliers in Graph Databases—An Implementational Example

To detect outliers according to the proposed definition and the IF–THEN method proposed. The IF–THEN method is based on different implications: Product, Łukasiewicz,

K_{1}

,

K_{2}

and subsumed under three different defined S-shape functions. The tests were performed on 2118 fuzzy rules. The application of the above implications enriched our experiment. We detected new outliers. Not all of the implications used have detected or classified a given fuzzy rule as an outlier. Therefore, they were looked at in detail. A fuzzy rule was considered an outlier rule if the degree of sufficient coverage

C \leq 0.1

and the degree of outlierness

O (R_{k}) \geq 0.95

. An object firing such a rule is classified as an outlying object.

The threshold of “3 out of 4 implications” was chosen to strike a balance between robustness and sensitivity. Requiring full consensus (4/4) eliminates too many borderline cases, while 2/4 introduces noise. The selected setting yielded the most coherent results in empirical validation with experts.

Similarly, the thresholds

O (R_{k}) \geq 0.95

and

C \leq 0.1

were selected based on preliminary experiments and were confirmed in the sensitivity analysis in Section 5.3. Analysis and interpretation of the results showed that the detected outliers using other implications, e.g.,

K_{2}

had a higher degree of outliers

O (R_{k})

(e.g.,

0.96

) than other implications, e.g., product.

An example of a fuzzy rule is as follows:

\begin{matrix} I F t h e c o m p l a i n t i s s u b m i t t e d i n \\ t h e m i d d l e o f s p r i n g \\ A N D t h e s u b m i t t e r c o m e s f r o m \\ a r i c h c o u n t y (m e d i a n h o u s e h o l d) \\ T H E N i n a n a v e r a g e t i m e \\ C o n s u m e r F i n a n c i a l P r o t e c t i o n B u r e a u (C F P B) s e n d s a c o m p l a i n t \end{matrix}

(7)

In our experiments, we tested three S-shaped functions for modeling the transition of coverage degree as follows:

S1: Logistic-like function

$g_{1} (r) = \frac{1}{1 + e^{- a (r - r_{0})}}$

where $a > 0$ is the steepness, and $r_{0}$ is the midpoint.
S2: Piecewise linear sigmoid

$g_{2} (r) = \{\begin{matrix} 0 & if r < r_{1} \\ \frac{r - r_{1}}{r_{2} - r_{1}} & if r_{1} \leq r \leq r_{2} \\ 1 & if r > r_{2} \end{matrix}$
S3: Quadratic S-shape

$g_{3} (r) = \{\begin{matrix} 2 {(r - r_{1})}^{2} & if r_{1} \leq r \leq \frac{r_{1} + r_{2}}{2} \\ 1 - 2 {(r_{2} - r)}^{2} & if \frac{r_{1} + r_{2}}{2} < r \leq r_{2} \\ 0 & otherwise \end{matrix}$

Naturally, it is likely that different functions, e.g., S-shaped, sigmoid, etc., will influence the final results. Currently, the authors are investigating and testing alternative options.

Example

In the experiment presented in this section, we used a total of 6 fuzzy variables, each associated with a specific property from the dataset. These variables, along with their universes of discourse and membership function configurations, are as follows:

Number of days in a year: $X = {1, 2, \dots, 366}$ —represented by a triangular fuzzy set spring, supported over the interval [64, 218] (The authors consider a universe of discourse from 1 to 366 days because they intend to cover all days, including leap years) (Obviously, the spring starts on 21 March and ends on 22 June. However, in terms of fuzzy logic, by “spring” we rather understand “close to the spring season”, but we use “spring” for short),
County per capita income: $Y = [5, 70]$ (in thousands)—represented by the fuzzy label middle county, supported on [19, 55]
Median household income: $V = [5, 70]$ (in thousands)—represented by the fuzzy label rich county, defined on [37, 65]
Number of days to send complaint: $Z = [0, 30]$ —represented by the fuzzy label average time, defined on [2, 10]
GDP of the state: W = [28, 16,209] (in millions)—represented by the fuzzy label very small amount, defined on [28, 126]
Group category: F is a non-fuzzy attribute with 4 linguistic values: Older American, Servicemember, Older American and Servicemember, and none,

Each fuzzy variable uses 1 triangular membership function in this specific example. However, the methodology supports expanding this to multiple overlapping sets. The fuzzy rule base was generated by expert-guided construction, selecting combinations of antecedents and consequents from available attributes. In total, $2118$ fuzzy rules were generated and tested. Rules were considered outlier rules based on thresholds described in Definition 1 and Equation (2).

The presented fuzzy rules and implications are as follows:

Let A be a fuzzy set representing the linguistic label spring in the set

X = {1, 2, \dots, 366}

, indicating the number of days in a year when a complaint is submitted. The membership function

μ_{A}

is given by

μ_{A} (x) = \{\begin{matrix} \frac{x - 64}{77}, & x \in [64, 141] \\ \frac{- x + 218}{77}, & x \in [141, 218] \\ 0, & otherwise \end{matrix}

(8)

Let B represent the label middle county in the range

Y = [5, 70]

, depicting the per capita income in a county (in thousands) from which the complainant originates. Its membership function

μ_{B} (y)

is defined as

μ_{B} (y) = \{\begin{matrix} \frac{y - 19}{18}, & y \in [19, 37] \\ \frac{- y + 55}{18}, & y \in [37, 55] \\ 0, & otherwise \end{matrix}

(9)

Similarly, let C represent the label average time in the interval

Z = [0, 30]

, indicating the number of days between receiving and sending the complaint to the company by Consumer Financial Protection Bureau (CFPB). Its membership function

μ_{C} (z)

is given by

μ_{C} (z) = \{\begin{matrix} \frac{z - 2}{4}, & z \in [2, 6] \\ \frac{- z + 10}{4}, & z \in [6, 10] \\ 0, & otherwise \end{matrix}

(10)

Furthermore, let D represent the label rich county within the range

V = [5, 70]

, signifying the median household income (in thousands) of the complainant’s origin. The membership function

μ_{D} (v)

is defined by

μ_{D} (v) = \{\begin{matrix} \frac{v - 37}{14}, & v \in [37, 51] \\ \frac{- v + 65}{14}, & v \in [51, 65] \\ 0, & otherwise \end{matrix}

(11)

Explanation: The central parameter

b = 51

in the triangular membership function was determined based on the dataset including median household income values for all 50 U.S. states, ranging from $5k to $70k. The empirical median of these state-level values is $51k. Therefore, we set

b = 51

as the peak of the membership function, meaning that

μ_{D} (51) = 1

, and this value is considered the most typical case of a “rich county” in our model.

Similarly, let E denote the label very small amount in the range W = [28, 16,209], representing the average gross domestic product (GDP) from 2010 to 2014 for the state’s industry total (in millions of current dollars) from where the complainant originates. Its membership function

μ_{E} (w)

is given by

μ_{E} (w) = \{\begin{matrix} \frac{w - 28}{49}, & w \in [28, 77] \\ \frac{- w + 126}{49}, & w \in [77, 126] \\ 0, & otherwise \end{matrix}

(12)

Finally, F is a non-fuzzy set representing one of the labels: Older American, Service-member, Older American and Servicemember, or none.

The presented fuzzy rule, as an example, is defined as Equation (7):

The fuzzy implications used in the experiment, apart from the product and the Lukasiewicz implications also take into account the newly proposed engineering implications

I K_{1} (x, y) = \{\begin{matrix} \frac{\sqrt{x y}}{x + y - x y}, if x \neq 0 \lor y \neq 0 \\ 0, otherwise \end{matrix}

(13)

where

x, y \in [0, 1]

and

I K_{2} (x, y) = \{\begin{matrix} \frac{x y}{x + y}, if x \neq 0 \lor y \neq 0 \\ 0, otherwise \end{matrix}

(14)

where

x, y \in [0, 1]

.

Figure 1 illustrates the S-shape function used to compute the coverage degree C as defined in Equation (4). This function plays a crucial role in determining whether a fuzzy rule is considered an outlier rule, by setting a soft threshold on how many data objects must be covered by both the antecedent and consequent sets. The choice of the S-shape function enables gradual transitions rather than hard cutoffs, improving robustness in detecting borderline outliers. In our experiment, this function is applied uniformly to all fuzzy rules to compute their corresponding C values, which are then used in conjunction with the outlierness threshold

O (R_{k}) \geq 0.95

to classify rules as outlier rules.

To detect outliers, the degree of truth (3) must be computed. To do this, the product implication was used. To evaluate the degree of sufficient coverage, the S-shape function shown in (Figure 1) is used.

To illustrate how the proposed method works in practice, we present a simplified scenario derived from real data. In this example, we consider a complaint submitted on 31 March (day 90 of the year), where the complainant resides in a county with a per capita income of $54k. These attribute values are used to compute the membership degrees for the fuzzy sets “spring” (for submission date) and “rich county” (median household), as defined earlier. The objective is to demonstrate how these degrees influence the rule activation via fuzzy implication, and ultimately how the rule may be classified as an outlier rule based on thresholds defined in Equations (2)–(4). Using the defined membership functions, the degrees are computed as

μ_{A} (90) = \frac{90 - 64}{77} \approx 0.337 and μ_{B} (54) = \frac{- 54 + 55}{18} \approx 0.055

This corrects the earlier approximation and aligns with the defined triangular membership functions. Applying the product implication, the rule value is determined to be approximately 0.02. This compute illustrates how fuzzy rules evaluate the relevance of input data, laying the foundation for identifying outliers based on defined thresholds.

Then, based on Equations (1)–(6), we have

C = 0, O (R_{k}) = 0.97

Computations based on Equations (1)–(6) are carried out for each fuzzy rule. A criterion for identifying an outlier is experimentally taken as if the degree of sufficient coverage

C \leq 0.1

and the degree of outlier is

O (R_{k}) \geq 0.95

, then every object firing the given rule is considered an outlying object, and the rule is labeled as an outlier rule. E.g., the rule given in Equation (7) is a rule being fired by outlying objects.

Figure 2a–e contain examples of membership functions utilized in the aforementioned fuzzy rules. These visual representations assist in understanding the behavior and distribution of membership values across different sets and their relationship to the rules defined in Equations (1)–(6).

As one can see in Table 1, rules 85 is found as representing exceptional objects that fires this rule since it achieves

O (R_{k}) \geq 0.95

for three out of four implications Rules 121, 137, 1649 achieve

O (R_{k})

for two out of four implications and they are also considered as representing exceptional objects firing them. Tests were performed based on membership functions, and fuzzy sets contained in Section 4. Based on the given thresholds, fuzzy rules are considered as determining outliers if the thresholds are satisfied for at least three fuzzy implications. It is worth mentioning that the division that was applied is regular. It is a ratio of 1/2 for each range of data for each property. It was observed that when tests were carried out on differently defined fuzzy sets (where irregular splits were used), a fuzzy rule was considered exceptional, for example, by only one implication Equation (12). Thus, it was concluded that properly created, generated fuzzy sets can affect the accuracy of exception detection by the IF–THEN method.

As a result, we obtained the following four unique fuzzy rules

D_{o u t 1}

,

D_{o u t 2}

,

D_{o u t 3}

, and

D_{o u t 4}

:

85. IF the complaint is submitted in the middle of spring AND the submitter comes from a rich county (median household), THEN in an average time CFPB sends a complaint.
121. IF the complaint is submitted in early winter AND the submitter comes from a rich county (median household) THEN in an average time CFPB sends a complaint.
137. IF the complaint is submitted in the middle of spring AND the submitter comes from a rich county (median household), THEN in an average time CFPB sends a complaint.
1649. IF a complaint is submitted by an older American or a service member AND concerned about a state that has a very small amount (GDP), THEN the submitter comes from a rich county (median household).

It is associated with 32 objects. IDs of the objects are

D_{o u t} =

{28,939, 41,365, 41,683, 43,196, 44,358, 364,520, 372,521, 375,975, 377,404, 383,137, 389,866, 395,693, 550,401, 630,491, 659,478, 744,965, 755,635, 755,712, 760,146, 763,847, 773,246, 788,230, 792,773, 801,371, 801,691, 804,591, 805,340, 805,828, 819,496, 833,603, 948,708, 1,115,287}.

The Definition 1 introduced allows detection and, more importantly, recognition of specific objects (which are outliers). We identified 4 outlier rules, which were triggered by 32 distinct outlying objects. The method has been tested on real-world data from the Consumer Complaint Database maintained by the U.S. Consumer Financial Protection Bureau (CFPB). This dataset includes over 40,000 anonymized consumer complaints along with metadata such as complaint type, submission date, consumer status (e.g., Older American, Servicemember), and economic indicators (e.g., county income, GDP). Outliers in this dataset may indicate incorrect data input, inconsistencies, or genuinely anomalous patterns related to consumer behavior or reporting.

5. Results and Discussion

The dataset was also analyzed using the widely adopted Local Outlier Factor (LOF) method for outlier detection. However, as LOF relies solely on numerical data, it could not incorporate all object properties, particularly those represented linguistically. For example, linguistic attributes such as “Older American” or “Average Time” were excluded from the analysis. This limitation highlights the advantage of the proposed fuzzy logic approach (see Table 2) in handling datasets with heterogeneous attributes, where linguistic and numerical features coexist. In this analysis, only the following object properties were used: Id, zipcode, date received, county per capita income, and time of sending complaint. The dataset used in our study is derived from the Consumer Complaint Database [27] maintained by the U.S. Consumer Financial Protection Bureau. It contains anonymized records of consumer complaints submitted over several years, and includes both categorical and numerical fields. Key variables include complaint category, submission date, consumer demographics (e.g., whether the complainant is an older American or service member), product or issue type, ZIP code, geographic and economic indicators (such as county per capita income and median household income), and the number of days between submission and resolution. A full description of all parameters is available in [27]. A detailed description of the input parameters for the LOF algorithm (using the scikit-learn library) is available at [28].

Using the default LOF settings (in the scikit-learn library), the algorithm classified 3206 objects as outliers from a total of 40,083 objects in the dataset, representing

8 %

of the data. Subsequently, it was decided to evaluate the performance of the LOF algorithm under various parameter values. These results are shown in Table 3.

Note: Although the dataset contains 40,083 entries, the object IDs are non-contiguous due to anonymization and prior filtering. Thus, some IDs may appear to exceed the stated dataset size.

To construct Table 3, we tested the LOF method using a range of parameter values. Some configurations use default settings from the scikit-learn implementation, while others were manually adjusted to analyze the method’s sensitivity. Specifically, we varied the contamination parameter to reflect different assumptions about the expected proportion of outliers and tested several distance metrics (e.g., Minkowski, correlation) to observe their impact on detection results. These choices help illustrate the instability of LOF under small parameter shifts, as shown in the resulting number of detected outliers.

It is worth noticing, in Row 3, that using the contamination parameter

0.0012

results in a relatively low F1 score. In addition, changing the distance metric from Minkowski to correlation results in a noticeable decrease in the number of objects proposed to be outliers by LOF. This illustrates that LOF is highly sensitive not only to contamination, but also to other parameters, such as the choice of metric or number of neighbors. Such sensitivity makes the method difficult to tune consistently across datasets or application domains, especially when no ground truth is available for validation.

As it is shown in Table 4, the If–Then method is able to find 32 new truly positive outliers. The LOF method is very sensitive to entry parameters of the classification. The most satisfactory results were obtained for the parameters displayed in Table 3, in the last row. Verification of the detected outliers was performed empirically. By empirical verification, we mean that the detected outliers were manually reviewed with regard to their feature values and combinations. In the case of the LOF method, several detected outliers showed minor deviations in numerical values, but without clear semantic anomalies. In contrast, the fuzzy IF–THEN method identified records that contained inconsistent or extreme attribute combinations (e.g., very short resolution time combined with low GDP and special demographic status). Furthermore, we consulted these results with domain experts from the banking sector, who confirmed that a subset of the 32 outliers identified by the fuzzy method were genuinely interesting from a risk analysis and anomaly detection perspective. This reinforces the practical relevance of our approach. Among the various parameter configurations, the

c o n t a m i n a t i o n

parameter, which controls the proportion of outliers in the dataset, was modified. For these specific parameters, the LOF method identified 6, 9, and 11 objects as outliers, which were considered exceptional. The record IDs of these outliers are listed above.

5.1. Comparison with Other Fuzzy-Based Outlier Detection Methods

In addition to the comparison with the numerical Local Outlier Factor (LOF) approach, we evaluated our method against selected fuzzy-based anomaly detection methods from the literature as follows:

Modified Fuzzy Clustering for Intrusion Detection [17]—based on clustering density deviations.
Fuzzy Neural Networks for Cyber Anomaly Detection [20]—a hybrid method combining learning-based detection with fuzzy rule layers.
Fuzzy Logic for Time Series Outliers [15]—a rule-less fuzzy method for detecting irregular points in temporal data.
Explainable Unsupervised Anomaly Detection with Random Forest [29]—an unsupervised tree-based method that distinguishes real data from synthetically generated samples and provides local interpretability of outlier decisions.

Each method was tested on a subset of our data (converted to match format and features when needed). The following Table 5 compares the results.

The method, in its current version, does not follow the traditional training–testing paradigm. We do not use any training data apart from expert knowledge and the manual labeling of data by the experts. The method does not require any training data, as it is fully based on expert knowledge, including both the construction of fuzzy rules and the assignment of reference labels used for evaluation. No automated learning procedure is involved. Our method demonstrates comparable or better performance in terms of detection, while providing higher interpretability and the ability to incorporate domain expert knowledge directly via linguistic labels and fuzzy rules.

We tested combinations of rule coverage thresholds

C \in {0.0, 0.05, 0.1, 0.2}

with

C_{\max} \in {0.0, 0.1}

, ensuring that

C \leq C_{\max}

. The value of

C_{\max}

was chosen based on experimental observations to exclude overly general rules from anomaly detection—specifically, if a rule covers more than 10% of the dataset, the matched records are unlikely to represent true exceptions.

Instead of conducting formal hypothesis testing, we evaluated the impact of coverage thresholds through comparative analysis of rule activation and outlier detection behavior. The results showed a clear trend: lower C values produce fewer but more specific rules that highlight semantically atypical data records, while higher values generate broader, less focused patterns.

This finding aligns with our analysis of the LOF algorithm, where parameter tuning (e.g., contamination) similarly influences sensitivity and selectivity. However, the fuzzy system provides clearer interpretability, as its thresholds reflect semantic reasoning about the generality of rules.

A comparison of these results with the outcomes of the IF–THEN method in Table 6 shows that, although only four outlier rules are explicitly listed, they cover a broader range of records—several of which were not identified by the LOF method under various configurations. This suggests that the fuzzy IF–THEN approach can reveal anomalies based on semantic combinations of attributes, rather than solely on distance-based numerical deviation. This demonstrates the main advantage of the proposed approach—it allows us to reveal outliers (validated by experts) that cannot be identified without the specific linguistic knowledge and fuzzy representations provided. The comparison of the results of the outlier detection experiment was performed on the database using the LOF and the IF–THEN method, and it is shown in Table 6.

In order to evaluate the performance of the IF–THEN method for detecting and recognizing outliers based on fuzzy rules, it was compared with another existing method in the literature: LOF. Comparing the results of the outlier detection experiment was performed on the database using the LOF and the IF–THEN method shown in Table 6.

In Table 6, we compare a subset of results produced by the LOF algorithm and the IF–THEN rule-based method. While we used standard contamination values in our comparisons (e.g.,

0.0002

and

0.0003

), we acknowledge that exploring smaller thresholds with expert-based validation may help fine-tune the method’s sensitivity. This is planned as part of future work. The contamination parameter for LOF was set to

0.0002

, which yielded six truly detected outliers. In contrast, the IF–THEN method truly detected another 32 outliers, which are different than those detected by LOF (see Table 6, column Ids objects). This difference in the number of outliers is intentional and reflects the distinct mechanisms used by each approach. While LOF detects local density deviations, the fuzzy method identifies records that violate semantic consistency among input variables. In addition, we manually inspected representative outliers from both methods. The LOF method tended to flag records with slight numeric anomalies, whereas the fuzzy method often identified more semantically inconsistent cases—for example, complaints with extremely short resolution times from high-GDP counties involving special-status groups. These results suggest that the fuzzy rule-based method complements numerical approaches by offering interpretability and deeper contextual insights.

5.2. Comparison with Machine Learning Techniques

In addition to statistical approaches (e.g., LOF, Random Forest) and other fuzzy-based methods, it is worth discussing how the proposed fuzzy IF–THEN technique compares with machine learning models, particularly deep neural networks with regularization mechanisms such as dropout layers. Neural networks have demonstrated high accuracy in various anomaly detection and fraud detection tasks; however, they typically require (i) extensive labeled datasets for training, (ii) significant computational resources, and (iii) careful hyperparameter tuning. Moreover, their decision-making process is often considered opaque, making it challenging to interpret why a particular object is flagged as an outlier—a key limitation in sensitive domains like finance and banking.

By contrast, the fuzzy rule-based approach presented in this study does not rely on labeled data or prior distributional assumptions and can directly integrate linguistic expert knowledge (e.g., attributes such as “Older American” or “Average Time”). This integration improves interpretability and facilitates expert validation of results. While advanced neural architectures may outperform fuzzy rules on purely numerical datasets, the proposed method provides complementary value by effectively handling heterogeneous datasets (combining numerical and linguistic attributes) and offering transparent decision criteria through explicit fuzzy rules.

As seen in Table 3, the LOF method seems to be unstable, as it gives different sets of outliers depending on very little changes in its entry parameters. Moreover, the numbers of objects in these sets meaningfully differ from one another and, finally, also differ from the set of outlying objects proposed by the IF–THEN method. Thus, we point out the conclusions below.

Sensitivity of LOF for its entry parameters.

As it is illustrated in Table 3, the LOF method is sensitive to its entry parameters, e.g., contamination or metric. The numbers of detected outliers (6, 9, 11, 33) change meaningfully along with really small changes in entry values.

Empirical evaluation of results. The validation of detected outliers was supported by two domain experts: one data analyst with over ten years of experience in financial supervision and one risk management officer from a commercial bank specializing in credit risk assessment. They independently reviewed selected records flagged by the fuzzy IF–THEN method and confirmed that many of them revealed atypical or inconsistent attribute combinations, such as implausibly short resolution times for high-income regions involving vulnerable consumer groups. This external evaluation reinforced the validity and interpretability of the fuzzy-based anomaly detection method. Moreover, the use of linguistic input variables significantly enhanced interpretability. Since the fuzzy rules are constructed using human-readable terms (e.g., “high income”, “very short resolution time”, “vulnerable consumer”), domain experts could intuitively understand the logic behind each flagged observation. This transparency allowed them to validate not only the anomaly itself but also the rationale for its detection. Importantly, some flagged cases were not statistical outliers in any single attribute but were considered exceptional due to the combination of conditions—something that emerged naturally from the linguistic, rule-based representation. This supports the method’s usefulness as claimed in the abstract, particularly in contexts involving imprecise or human-centric data representations.

The comparison with the IF–THEN method in Table 6 show that the IF–THEN method detected 32 outlying objects based on fuzzy rules. On the other hand, the six objects detected by LOF do not overlap with the 32 flagged by the fuzzy IF–THEN method, highlighting the distinct nature of outliers identified by each approach.

Specifity of data and limitations. The processed dataset contains both numerical and linguistic values, but the LOF method is not able to process those linguistic values, while the IF–THEN method definitely can.

The IF–THEN method has detected, based on expert knowledge stored in fuzzy sets and rules, other exceptions that the LOF method would not have found. The examples of properties that numerical-based methods did not take into account during computing are, for example, “older American, service member”, and “average time”.

The obtained results were verified empirically. In addition, the outliers found by the IF–THEN method were consulted with banking experts.

5.3. Sensitivity to Parameters

To evaluate the robustness of the method, we performed a parameter sensitivity analysis by varying the following:

Fixed Threshold $κ$ (see Definition 1): tested values in {0.90, 0.92, 0.95, 0.98}.
Coverage threshold $C_{\max}$ : tested values in {0.0, 0.05, 0.1, 0.2}.
Required Implications: two out of four, three out of four, four out of four implications.

The results show (Table 7) that the number of detected outlying objects is highly sensitive to parameter changes. This underlines the importance of expert-driven calibration or adaptive optimization techniques in future work.

It should be noted that both the fuzzy IF–THEN approach and the LOF algorithm were evaluated on the entire dataset, consisting of over 40,000 consumer complaint records. No data subsetting was performed, and both methods operated on the same preprocessed version of the full dataset. This ensured a fair and consistent comparison of detection results across all experiments.

6. Conclusions

Effectiveness of Fuzzy Logic Application: The proposed IF–THEN method for detecting outliers based on fuzzy logic and IF–THEN rules proved effective in identifying anomalies in data, particularly those represented linguistically. This enables the detection of outliers that traditional numerical methods, such as LOF (Local Outlier Factor), fail to identify.
Universality of the Approach: The new definition of outliers, grounded in fuzzy rules, demonstrated its universal applicability to both relational and non-relational datasets. This increases its potential utility in various fields, including finance, medicine, and security analysis.
Importance of Expert Knowledge: The integration of expert knowledge through linguistic descriptions of data allowed for more precise identification of the unique characteristics of outlier objects. This approach emphasizes the importance of interpretability and expert validation, enhancing the credibility of the obtained results.
Advantages Compared to LOF: A comparison between the results of the LOF method and the fuzzy logic-based approach showed that the IF–THEN method identifies outliers that would not be detected by analyses solely based on numerical data. This is particularly evident in the case of properties expressed in linguistic terms, such as “Older American” or “Average Time.”
Potential for Integration with Other Methods: The fuzzy rule-based method can act as a complement to existing algorithms, such as LOF, enriching results by identifying additional outliers. This approach is particularly useful in the analysis of multifaceted datasets, where the diversity of features requires the application of different analytical techniques.
Significance of Parameters and Their Interpretation: The introduction of parameters, such as the degree of coverage (C) and the degree of outlierness ( $O (R_{k})$ ), enables precise definition of the conditions for classifying objects as outliers. The selection of appropriate functions, such as S-shaped functions, significantly impacts the accuracy of outlier detection, as demonstrated in the experiments. We note that the fuzzy inference model employed S-shaped membership functions to represent continuous variables such as income, population, and resolution time. This type of function was selected for its smooth transition between linguistic categories, making it suitable for modeling gradual phenomena. As mentioned earlier in Section 3, preliminary experiments confirmed that S-shaped functions led to stable rule activation patterns and consistent outlier detection. Although a formal comparison with alternative shapes was not included, the results support their use as semantically meaningful and operationally effective. A comparative evaluation of different membership function types remains a valuable direction for future work.
Practical Application and Validation: The outlier detection results were empirically verified and consulted with banking experts, confirming the method’s practical utility in real-world business scenarios.
Impact of Dataset Structure: Experimental observations indicate that the proper definition and construction of fuzzy sets influence the effectiveness of outlier detection. Future research may focus on optimizing these parameters for specific applications.

7. Future Work

The proposed fuzzy logic-based IF–THEN method for outlier detection demonstrates significant potential for further development and adaptation. The following areas have been identified for future research and improvement:

Dynamic Optimization of Parameters: Future research could explore adaptive methods for selecting the key parameters of the fuzzy rules, such as the degree of coverage (C) and the degree of outlierness ( $O (R_{k})$ ). This could involve machine learning techniques to dynamically optimize these parameters based on the characteristics of the dataset, ensuring better performance across different domains and applications.
Application to Real-Time Systems: The integration of the proposed method into real-time data processing systems presents a promising direction. Real-time anomaly detection is particularly relevant for domains such as cybersecurity, where timely identification of outliers (e.g., network intrusions) is critical. Implementing the method in a streaming data environment would require further optimization of computational efficiency.
Extending to Heterogeneous Data: While the current study focuses on specific datasets, future work could evaluate the method’s applicability to more diverse and complex datasets, such as those combining numerical, categorical, textual, and temporal data. This would validate the method’s robustness in handling heterogeneous data structures.
Integration with Other Outlier Detection Methods: To further enhance the detection of anomalies, future research could investigate the hybridization of the fuzzy logic-based approach with traditional numerical methods, such as Local Outlier Factor (LOF) or clustering algorithms. This integration could provide a more comprehensive outlier detection framework, leveraging the strengths of both numerical and linguistic representations.
Domain-Specific Applications: The IF–THEN method could be customized and tested in domain-specific scenarios, such as the following:
- Healthcare: Detecting outliers in patient data to identify rare symptoms or unusual disease progression.
- Finance: Identifying fraudulent transactions or irregular patterns in banking datasets.
- IoT Systems: Recognizing anomalous behavior in sensor networks or smart devices.
Exploration of Explainability: The linguistic nature of the fuzzy logic-based method inherently enhances interpretability. Future work could focus on improving the explainability of the results by developing tools or frameworks that visualize the fuzzy rules and their contributions to the identification of outliers, aiding decision-making by domain experts.
Scalability and Performance: Expanding the IF–THEN method to accommodate large-scale datasets is another area for future work. This includes optimizing the computational complexity of the fuzzy rule evaluation process and investigating distributed or parallel processing techniques to improve scalability.

By addressing these areas, the proposed fuzzy logic-based outlier detection IF–THEN method could be significantly enhanced, enabling its adoption in a wider range of applications and making it a valuable tool for modern data analysis challenges.

Author Contributions

Conceptualization, M.K.; methodology, M.K.; software, M.K.; validation, M.K. and A.N.; formal analysis, M.K.; investigation, M.K.; resources, A.N.; data curation, M.K.; writing—original draft preparation, M.K.; writing—review and editing, M.K. and A.N.; visualization, M.K.; supervision, A.N.; project administration, A.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hawkins, D.M. Identification of Outliers; Springer: Berlin/Heidelberg, Germany, 1980; Volume 11. [Google Scholar]
Aggarwal, C.C. Outlier Detection in Categorical, Text, and Mixed Attribute Data. In Outlier Analysis; Springer: Berlin/Heidelberg, Germany, 2017; pp. 249–272. [Google Scholar]
Campos, G.O.; Zimek, A.; Sander, J. On the Evaluation of Outlier Rankings and Outlier Scores. In Proceedings of the KDD ’17 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017. [Google Scholar]
Aggarwal, C.C.; Yu, P.S. Outlier detection for high dimensional data. In Proceedings of the ACM Sigmod Record, Santa Barbara, CA, USA, 21–24 May 2001; ACM: New York, NY, USA, 2001; Volume 30, pp. 37–46. [Google Scholar]
Breunig, M.M.; Kriegel, H.P.; Ng, R.T.; Sander, J. LOF: Identifying density-based local outliers. In Proceedings of the ACM Sigmod Record, Dallas, TX, USA, 15–18 May 2000; ACM: New York, NY, USA, 2000; Volume 29, pp. 93–104. [Google Scholar]
Tang, J.; Chen, Z.; Fu, A.W.C.; Cheung, D.W. Enhancing Effectiveness of Outlier Detections for Low Density Patterns. In Knowledge and Information Systems; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
Knorr, E.M.; Ng, R.T.; Tucakov, V. Distance-based outliers: Algorithms and applications. VLDB J.—Int. J. Very Large Data Bases 2000, 8, 237–253. [Google Scholar] [CrossRef]
Breunig, M.M.; Kriegel, H.P.; Sander, J. Density-Based Clustering of Spatial Data with Noise. In Proceedings of the KDD ’01 Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 26–29 August 2001. [Google Scholar]
Song, X.; Wu, Q.J.; Jermaine, C. Conditional Anomaly Detection. In Proceedings of the SIGMOD ’07 2007 ACM SIGMOD International Conference on Management of Data, Beijing, China, 11–14 June 2007. [Google Scholar]
Bartczak, M.; Niewiadomski, A. Linguistic summaries of graph databases in customer relationship management (CRM). J. Appl. Comput. Sci. 2019, 27, 7–26. [Google Scholar]
Bartczak, M.; Kacprowicz, M. Podniesienie Poziomu Bezpieczeństwa Danych Bankowych Poprzez Wykrywanie Wyjątków; Zeszyty Naukowe Zbliżenia Cywilizacyjne, State Vocational University in Wloclawek: Włocławek, Poland, 2021. (In Polish) [Google Scholar]
Niewiadomski, A.; Kacprowicz, M.; Bartczak, M. Outliers Detection In Graph-Represented Databases Using Fuzzy Rules. In Proceedings of the Pacific Asia Conference on Information Systems, PACIS 2021, Dubai, United Arab Emirates, 12–14 July 2021. [Google Scholar]
Kacprowicz, M.; Bartczak, M.; Niewiadomski, A. Detection and recognition of outliers by the use of IF-THEN rules. In Proceedings of the 3rd Polish Conference on Artificial Intelligence, PP-RAI’2022, Gdynia, Poland, 25–27 April 2022. [Google Scholar]
Wang, J.F.; Liu, X.; Zhao, H.; Chen, X.C. Anomaly Detection of Complex Networks Based on Intuitionistic Fuzzy Set Ensemble. Chin. Phys. Lett. 2018, 35, 058901. [Google Scholar] [CrossRef]
Suresh, S.; Kannan, K. Identifying outliers in fuzzy time series. J. Mod. Appl. Stat. Methods 2011, 10, 30. [Google Scholar] [CrossRef]
Cateni, S.; Colla, V.; Vannucci, M. A fuzzy logic-based method for outliers detection. In Proceedings of the 25th Multi-Conference on Applied Informatics, Innsbruck, Austria, 12–14 February 2007. [Google Scholar]
Harish, B.S.; Kumar, S.V.A. Anomaly based Intrusion Detection using Modified Fuzzy Clustering. Int. J. Interact. Multimed. Artif. Intell. 2017, 4. [Google Scholar] [CrossRef]
Garg, S.; Batra, S. A novel ensembled technique for anomaly detection. Int. J. Commun. Syst. 2019, 30, e3248. [Google Scholar] [CrossRef]
Novaes, M.P.; Carvalho, L.F.; Lloret, J.; Proença, M.L.J. Long Short-Term Memory and Fuzzy Logic for Anomaly Detection and Mitigation in Software-Defined Network Environment. IEEE Access 2020, 8, 83765–83781. [Google Scholar] [CrossRef]
Mungara, K.K.; Gopi, V.; Kumar, M.K. Detection of Cyber Anomaly Using Fuzzy Neural Networks. J. Eng. Sci. 2020, 11, 48–53. [Google Scholar]
Moniruzzaman, A.B.M.; Hossain, S.A. NoSQL Database: New Era of Databases for Big data Analytics—Classification, Characteristics and Comparison. Int. J. Database Theory Appl. 2013, 6. Available online: https://www.researchgate.net/publication/243963821_NoSQL_Database_New_Era_of_Databases_for_Big_data_Analytics_-_Classification_Characteristics_and_Comparison (accessed on 26 November 2025).
Bartczak, M.; Kacprowicz, M. Detekcja wyjątków metodami agregacji rozmytej w grafowych systemach CRM. In Wyzwania Gospodarcze, Polityczne i społEczne w Globalnej Gospodarce; State Academy of Applied Sciences in Włocławek: Włocławek, Poland, 2022. (In Polish) [Google Scholar]
Kosko, B. Fuzziness vs. probability. Int. J. Gen. Syst. 1990, 17, 11–240. [Google Scholar] [CrossRef]
van den Berg, J.; Kaymak, U.; van den Bergh, W.M. Fuzzy classification using probability-based rule weighting. In Proceedings of the 2002 IEEE International Conference on Fuzzy Systems, Honolulu, HI, USA, 12–17 May 2002; pp. 991–996. [Google Scholar]
Klir, G.J.; Yuan, B. Fuzzy Sets and Fuzzy Logic: Theory and Applications; Prentice-Hall: Upper Saddle River, NJ, USA, 1932. [Google Scholar]
Wu, D.; Mendel, J.; Joo, J. Linguistic summarization using IF-THEN rules. In Proceedings of the IEEE International Conference on Fuzzy Systems, Barcelona, Spain, 18–23 July 2010; pp. 1–8. [Google Scholar]
Consumer Complaint Database. Available online: https://catalog.data.gov/dataset/consumer-complaint-database (accessed on 30 June 2020).
Documentation for the Scikit-Learn Library. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html (accessed on 30 June 2020).
Xu, C.; Wang, J.; Li, H.; Wang, W. Explainable Unsupervised Anomaly Detection with Random Forest. Future Internet 2023, 15, 103. [Google Scholar] [CrossRef]

Figure 1. S-shape function used in implementational example.

Figure 2. Examples of membership functions. (a) Season of the year. (b) Per capita income in the county (represented in thousands) from which comes a person who submits a complaint. (c) Median household (represented in thousands) from which comes a person who submits a complaint. (d) Period between receiving and sending the complaint to the company by CFPB (Consumer Financial Protection Bureau). (e) Average gross domestic product (GDP) in 2010–2014 years for all industry total for the state (represented in millions of current dollars) from which comes a person who submits a complaint.

Table 1. Four outlier rules with their respective values of

O (R_{k})

and C, indicating the objects classified as outlying.

Table 1. Four outlier rules with their respective values of

O (R_{k})

and C, indicating the objects classified as outlying.

Rule No.	Fuzzy Rules	$PROD$		$LUK$		${IK}_{1}$		${IK}_{2}$
Rule No.	Fuzzy Rules	$O (R_{k})$	$C$	$O (R_{k})$	$C$	$O (R_{k})$	$C$	$O (R_{k})$	$C$
85.	IF the complaint is submitted in the middle of spring AND the submitter comes from a rich county (median household) THEN in an average time CFPB sends a complaint.	0.94	0	0.97	0	0.96	0	0.96	0
121.	IF the complaint is submitted in early winter AND the submitter comes from a rich county (median household) THEN in an average time CFPB sends a complaint.	0.94	0	0.97	0	0.86	0	0.95	0
137.	IF the complaint is submitted in the middle of spring AND the submitter comes from a rich county (median household) THEN in an average time CFPB sends a complaint.	0.90	0	0.97	0	0.86	0	0.96	0
1649.	IF a complaint is submitted by OlderAmerican, Service-member AND concerned about a state which has a very small amount (GDP) THEN the submitter comes from a rich county (median household).	0.90	0	0.92	0	0.96	0	0.94	0

Table 2. Comparison of the IF–THEN method and LOF method.

Feature	IF–THEN Method	LOF Method
Handles linguistic attributes	Yes	No
Sensitivity to data structure	Low to Medium	High
Expert knowledge integration	Yes	No
Truely detected outliers	32	0 to 6

Table 3. Results of the LOF method.

n_ Neighbors	Algorithm	Leaf\ _Size	Metric	p	Metric\ _Params	Contamination	Accuracy	F1	Number of Objects Proposed to be Outliers
20	auto	30	minkowski	2	None	auto	$91.88 %$	$0.0012$	3206
50	auto	100	jaccard	2	None	auto	$99.87 %$	$- -$	0
20	auto	30	correlation	4	None	0.0012	$99.81 %$	$0.0714$	33
20	auto	30	correlation	4	None	0.0004	$99.85 %$	$0.0322$	11
20	auto	30	correlation	2	None	0.0004	$99.85 %$	$0.0322$	11
20	auto	30	correlation	2	None	0.0003	$99.86 %$	$0.0333$	9
20	auto	30	correlation	2	None	0.0002	$99.89 %$	$0.2105$	6

Table 4. Confusion matrices for (a) LOF method (see Table 3) and (b) IF–THEN method.

(a) LOF method
	Predicted condition
Actual condition	Positive (PP)	Negative (PN)
Positive (P)	6	45
Negative (N)	0	40,032
accuracy = $99.89 %$ F1 = $0.2105$
(b) IF–THEN method
	Predicted condition
Actual condition	Positive (PP)	Negative (PN)
Positive (P)	32	19
Negative (N)	0	40,032
accuracy = $99.95 %$ F1 = $0.7710$

Table 5. Comparison of IF–THEN method with other fuzzy-based techniques.

Method	Linguistic Input Support	Interpretability	Outlying Objects Detected	Accuracy	F1
IF–THEN (ours)	Yes	High (explicit rules)	32	99.95%	0.7710
Fuzzy clustering [17]	No	Low (centroid-based)	19	99.92%	0.5142
Fuzzy neural network [20]	Partially	Medium (learned rules)	27	99.91%	0.5128
Fuzzy time-series [15]	No	Medium	22	99.88%	0.3896
Random Forest [29]	No	Low	26	99.89%	0.4155

Table 6. Comparison of LOF and IF–THEN methods.

Method	Parameters	Number of Detected Outlying Objects	Ids Objects
LOF	n_neighbors = 20; algorithm = auto leaf_size = 30; metric = correlaction p = 2; metric\-params = None contamination = 0.0002; novelty = False _jobs = None	6	689,889, 725,545, 725,546, 744,072, 1,038,302, 1,087,276
IF–THEN method	$κ = 0.95$ $C_{\max} = 0.1$ Required Implications = 3 out of 4	32	28,939, 41,365, 41,683, 43,196, 44,358, 364,520, 372,521, 375,975, 377,404, 383,137, 389,866, 395,693, 550,401, 630,491, 659,478, 744,965, 755,635, 755,712, 760,146, 763,847, 773,246, 788,230, 792,773, 801,371, 801,691, 804,591, 805,340, 805,828, 819,496, 833,603, 948,708, 1,115,287

Table 7. Sensitivity analysis for selected parameters.

$κ$	$C_{\max}$	Required Implications	Truely Detected Outliers
0.90	0.1	3 of 4	28
0.95	0.1	3 out of 4	32
0.98	0.1	3 out of 4	3
0.95	0.1	2 out of 4	19
0.95	0.0	3 out of 4	14

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kacprowicz, M.; Niewiadomski, A. Detection of Outliers via Uncertain Knowledge and the IF–THEN Method. Appl. Sci. 2025, 15, 12833. https://doi.org/10.3390/app152312833

AMA Style

Kacprowicz M, Niewiadomski A. Detection of Outliers via Uncertain Knowledge and the IF–THEN Method. Applied Sciences. 2025; 15(23):12833. https://doi.org/10.3390/app152312833

Chicago/Turabian Style

Kacprowicz, Marcin, and Adam Niewiadomski. 2025. "Detection of Outliers via Uncertain Knowledge and the IF–THEN Method" Applied Sciences 15, no. 23: 12833. https://doi.org/10.3390/app152312833

APA Style

Kacprowicz, M., & Niewiadomski, A. (2025). Detection of Outliers via Uncertain Knowledge and the IF–THEN Method. Applied Sciences, 15(23), 12833. https://doi.org/10.3390/app152312833

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Detection of Outliers via Uncertain Knowledge and the IF–THEN Method

Abstract

1. Introduction

Novel Contributions of This Work

2. Literature Review

3. An Outlier in Terms of Fuzzy Rules

3.1. Clarification of Terminology

3.2. Generating of Fuzzy IF–THEN Rules

4. Detecting Outliers in Graph Databases—An Implementational Example

Example

5. Results and Discussion

5.1. Comparison with Other Fuzzy-Based Outlier Detection Methods

5.2. Comparison with Machine Learning Techniques

5.3. Sensitivity to Parameters

6. Conclusions

7. Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI