Next Article in Journal
Global Existence and Large-Time Behavior for 3D Full Compressible Magneto-Micropolar System Without Heat Conductivity
Next Article in Special Issue
A Unified Framework for the Upper and Lower Discretizations of Fuzzy Implications
Previous Article in Journal
New Approach for Closure Spaces on Graphs Based on Relations and Graph Ideals
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

GrImp: Granular Imputation of Missing Data for Interpretable Fuzzy Models

Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, 44-100 Gliwice, Poland
*
Author to whom correspondence should be addressed.
Axioms 2025, 14(12), 887; https://doi.org/10.3390/axioms14120887
Submission received: 6 November 2025 / Revised: 26 November 2025 / Accepted: 28 November 2025 / Published: 30 November 2025
(This article belongs to the Special Issue Advances in Fuzzy Logic and Fuzzy Implications)

Abstract

Data incompleteness is a common problem in real-life datasets. This is caused by acquisition problems, sensor failures, human errors, and so on. Missing values and their subsequent imputation can significantly affect the performance of data-driven models and can also distort the interpretability of explainable artificial intelligence (XAI) models, such as fuzzy models. This paper presents a novel imputation algorithm based on granular computing. This method benefits from the local structure of the dataset, explored using the granular approach. The method elaborates a set of granules that are then used to impute missing values in the dataset. The method is evaluated on several datasets and compared with several state-of-the-art imputation methods, both directly and indirectly. The direct evaluation compares the imputed values with the original data. The indirect evaluation compares the performance of fuzzy models built with TSK and ANNBFIS neuro-fuzzy systems. This enables not only the evaluation of the quality of numerically imputed values but also their impact on the interpretability of the constructed fuzzy models. This paper is accompanied by numerical experiments. The implementation of the method is available in a public GitHub repository.

1. Introduction

Data incompleteness is a common problem in real-life data. It can be caused by acquisition errors, random noise, impossible values, sensor failures, human error, or as a result of merging data from various sources. The authors of [1] presented an interesting medical example in which only one patient in 55 had all blood tests done. However, incomplete data may still hold important information. Therefore, this problem remains an active research topic.
A practical classification of missing ratios defines four intervals [2]:
  • [ 0 % , 1 % ] trivial;
  • [ 1 % , 5 % ] manageable;
  • [ 5 % , 15 % ] need sophisticated methods;
  • More than 15 % of missing values “severely impact any kind of interpretation” [2].
Generally, three approaches are used to handle the problem of missing values:
  • Marginalisation of incomplete data items [3] or incomplete data attributes [4];
  • Imputation of missing values [1,5];
  • Application of rough sets to model incompleteness [6,7,8,9].
Marginalisation is the simplest approach. It leaves only complete data in a dataset. It removes incomplete data items or, less commonly, incomplete attributes. Marginalisation is a very fast and easy method. However, it reduces the size of the dataset or the set of attributes. This may remove important information from the dataset. If all data items are incomplete, this approach returns an empty dataset. In practice, this method is very effective for low missing-ratio values (less than 3–5%). For datasets with high missing ratios, this method is not reliable. The returned dataset may be very small and may not represent the underlying data distribution without some bias. If all data tuples lack some values, it returns an empty dataset.
The second approach is imputation. This technique fills missing values. It is usually more complex and more commonly applied [10]. Many imputation techniques have been proposed (imputation with constants, averages, medians, k-nearest neighbours [11], feed-forward neural networks [12], inversion of neuro-fuzzy systems [5,13], generative adversarial imputation nets (GAINs) [14], and variational auto-encoders (VAEs) [15,16]. A comparison of imputation techniques for clustering of missing data is provided in [17]. The main problem with imputation is that it introduces non-existing values into the dataset. Some techniques may even impute values without physical meaning (e.g., a fractional number of children). Quite often, the imputed values cannot be distinguished from the original data. This distorts the data and weakens conclusions drawn from the analysis of imputed datasets. Application of the rough approach and simultaneous marginalisation and imputation retains the information in marginalised items as well as distinctions between the original and imputed data [8].
Granular computing (GrC) is a paradigm introduced by Lotfi Zadeh [18]. After many years of hibernation, GrC has re-emerged as a growing research area [19]. Today, we have at our disposal multiple techniques, algorithms, paradigms, and models that fall under the umbrella term “granular computing”. Yao proposed the triarchic theory of granular computing [20]. This theory embraces three perspectives: philosophical, which focuses on structured thinking using the meronym–holonym approach; methodological, which focuses on structured problem-solving; and computational, which focuses on information processing. Lotfi Zadeh envisioned “computation with words”, aiming to make computers operate in a human-like fashion [21,22,23].
A data granule is a crucial term in granular computing. It is a collection of entities in terms of indiscernibility, similarity, adjacency, proximity, and so on, that can be labelled with semantically rich labels [24,25,26,27,28,29,30,31,32]. The definition of an information granule is rather general and generic. This leads to diverse forms of granule representation. The form of granules is closely linked to the problem being solved. Common representations include intervals, classical sets, fuzzy sets, rough sets [33,34], interval type-2 fuzzy sets [35], soft sets [36], and if–then rules [37].
The granular approach is a starting point for the three-way decision paradigm [38,39,40]. In this paper, we not only focus on the precision of the imputed values but also on the impact of imputation on the interpretability of fuzzy models built on the imputed datasets. We use neuro-fuzzy systems because they are widely used in data modelling and are an example of explainable artificial intelligence (XAI) [27,37,38,41,42,43,44]. They build models with fuzzy rules that can be easily understood by humans [45]. Explainability and interpretability of artificial intelligence models are crucial issues nowadays. However, there is no universally accepted definition of interpretability in AI research [41,46,47,48,49,50]. The focus is more on various aspects of interpretability, such as input domain coverage and partitioning, the number of rules, the number of fuzzy sets, the number of attributes, overlapping of fuzzy sets in premises, distinguishability of linguistic variables, and so on. [51,52,53,54,55]. Due to the numerous criteria for interpretability, some of them are mutually contradictory. The authors of [38] presented an example of such a contradiction. The overlap of fuzzy sets in the premise part of rules in fuzzy systems should be reduced to ensure distinguishability, yet be preserved to retain rule interaction to keep the interference of rules [56].
Imputation of missing data is rarely used on its own. It is commonly employed as a preprocessing step in data analysis or data modelling. Therefore, it is crucial to consider how the imputation procedure influences subsequent modelling tasks, particularly when using interpretable models such as neuro-fuzzy systems. To the best of our knowledge, this issue has not been examined in depth in the literature. Although several granular algorithms for missing-data imputation have been proposed [57,58], they do not address the interpretability of models built on imputed data. In this article, we seek to fill this gap by proposing a granular imputation algorithm that addresses not only imputation accuracy but also the interpretability of the resulting neuro-fuzzy models.
The contribution of this paper is a novel granular algorithm for the imputation of missing values (Section 2). The granular approach is used to model the internal structure of the data. Fuzzy granules model the imprecision and uncertainty in data. Furthermore, fuzzy granules are complementary to fuzzy sets used in neuro-fuzzy systems. Thus, the fuzzy granule approach can be used for modelling missing data while preserving the interpretability of neuro-fuzzy systems built on imputed datasets.
In this paper, we use the following general rules for symbols:
  • Lowercase italics ( a ) —scalars and set elements;
  • Lowercase bold ( a ) —vectors;
  • Uppercase bold ( A ) —matrices;
  • Blackboard bold uppercase characters ( A ) —sets;
  • Uppercase italics ( A ) —cardinality of sets.

2. Materials and Methods

One of the challenges of dealing with incomplete data is preserving the underlying structure of the dataset while restoring missing values in a meaningful and interpretable way. Traditional imputation methods frequently disregard contextual dependencies and local uncertainty, leading to distorted estimates or oversimplified representations of the original data.
To address this, we propose a novel data imputation method based on granular computing. The GrImp algorithm (granular imputer), with the pseudocode shown in Algorithm 1, performs missing data imputation by analysing the structure of the complete part of the dataset, represented in the form of fuzzy information granules.
In the first step (line 2), the input dataset X is split into two disjoint subsets: the set of incomplete tuples X ¯ and the set of complete tuples X ̲ . This is done by removing all tuples that contain missing attribute values from the complete subset.
Next (line 4), the complete set X ̲ is used to generate a collection of fuzzy granules G using the Fuzzy C-Means (FCM) algorithm with the number of granules set to G. Each granule g is internally represented by its centre vector v g = v g 1 , , v g A and attribute-wise fuzziness vector s g = s g 1 , , s g A . If needed, other granulation techniques may be used. In this research, we use the FCM algorithm due to its linear complexity with respect to the number of data tuples, attributes, and granules.
For each incomplete tuple x X ¯ (line 5), the algorithm iterates through all granules g G (line 6). In each granule, the missing attribute values in x are temporarily filled with corresponding values from the granule’s centre v g (line 8). This yields one temporary imputed version x ¯ g per granule. Thus, each incomplete data tuple produces G imputed tuples. This avoids an exponential number of imputed tuples. This idea is illustrated in Figure 1.
Algorithm 1 The GrImp procedure: granular imputation of missing values.
Require:  X – dataset with missing values
Require: G – number of granules
 1:
procedure GrImp( X , G )                ▹ Complexity: O ( X G A )
 2:
     { X ¯ , X ̲ } marginalisation ( X )      ▹ X ¯ : incomplete data; X ̲ : complete data
 3:
     X imp X ̲                      ▹ Initialize result set
 4:
     G FCM ( X ̲ , G )        ▹ G : set of granules; FCM complexity O ( X G A )
               ▹ Each granule g has centre v g and fuzziness s g
 5:
    for all incomplete tuple x X ¯  do                 ▹ O ( X )
 6:
        for all granule g G  do                    ▹ O ( G )
 7:
           for all missing attribute value x a in x  do            ▹ O ( A )
 8:
                x a v g a ▹ fill missing value with corresponding granule centre attribute
 9:
           end for
            ▹ x ¯ g is the tuple completed with values from granule g
10:
            u g ( x ¯ g ) a = 1 A exp ( x ¯ a v g a ) 2 2 s g a 2  ▹ membership of x ¯ g to granule g, O ( A )
11:
        end for
12:
         x ¯ g = 1 G x ¯ g · u g ( x ¯ g ) g = 1 G u g ( x ¯ g ) ▹ aggregate over all granules to get completed tuple x ¯
                                ▹ O ( G A )
13:
         X imp X imp { x ¯ }          ▹ Add completed tuple to result set
14:
    end for
15:
    return  X imp                   ▹ Final imputed dataset
16:
end procedure
For granulation of the complete data, we use the Fuzzy C-Means (FCM) algorithm (line 4). FCM produces Gaussian fuzzy clusters (granules) represented by their centres and fuzzinesses. In the next step (line 10), the membership degree of each imputed tuple x ¯ g to the corresponding granule g is computed using the Gaussian similarity function. The membership degree is calculated for each attribute separately as the Gaussian similarity between the attribute value in x ¯ g and the corresponding granule centre attribute v g a , scaled by the granule’s fuzziness for that attribute s g a . The membership degrees across all A attributes are then aggregated using a t-norm ⋆ (e.g., product) to obtain the overall membership degree u g ( x ¯ g ) of the imputed tuple to the granule g.
For each incomplete tuple, G imputed versions x ¯ g are obtained, each associated with a membership degree u g ( x ¯ g ) reflecting how well the imputed tuple aligns with the structure of granule g. In the final step (line 12), these multiple imputed versions are aggregated into a single final imputed tuple x ¯ as a weighted average of all x ¯ g , where the weights are given by the corresponding membership degrees u g ( x ¯ g ) .
This weighted combination ensures that greater influence is given to granules that are structurally more consistent with the incomplete instance.
Such a granular approach enables flexible and context-sensitive imputation by leveraging the local structure of the complete data through fuzzy granules. Instead of relying on global statistics, GrImp uses soft (fuzzy) associations between incomplete tuples and representative granules to estimate missing values in a data-driven manner. The computed membership degrees reflect how well a tuple aligns with the structure of each granule and are used to weight the contributions of candidate imputations, providing an interpretable footprint (or trace) of the imputation process in terms of granule influence.
The free C++ implementation is available in a public GitHub repository www.github.com/ksiminski/neuro-fuzzy-library (accessed on 5 November 2025).

Computational Complexity

The computational complexity of the GrImp procedure can be analysed by examining its key operations. Let X = | X | denote the number of tuples in the dataset X , A the number of attributes, and G the number of granules:
  • Marginalisation (line 3) requires scanning the entire dataset to separate complete and incomplete tuples. Its complexity is O ( X · A ) .
  • Fuzzy granulation (line 5), performed using Fuzzy C-Means (FCM) clustering on the complete data X ̲ , has complexity O ( X · G · A ) .
  • The main loop (lines 7–17) iterates over all incomplete tuples X ¯ , each processed independently:
    For each tuple x X ¯ , the algorithm iterates over all granules g G .
    For each granule, missing attribute values are substituted (line 7), which requires O ( A ) operations.
    The membership degree u g ( x ¯ g ) (line 13) is computed as the product of Gaussian similarity values over all attributes, again O ( A ) .
    The final aggregation (line 15) involves a weighted sum over G granules, each involving a vector of length A, which yields O ( G · A ) .
    Overall, the per-tuple imputation cost is O ( G · A ) , and the full loop over X ¯ (incomplete tuples) has complexity O ( | X ¯ | · G · A ) O ( X · G · A ) .
By summing all contributions, the total computational complexity of the GrImp algorithm is O ( X · G · A ) . This makes GrImp linearly scalable with respect to dataset size, assuming G is treated as a fixed hyperparameter. Compared to the kNN-based methods with quadratic complexity in X, this provides a significant computational advantage for large-scale datasets.

3. Experiments

The experimental evaluation aimed to assess the accuracy and robustness of the proposed granular imputation method GrImp under varying missing ratios. The method was evaluated and compared using direct (Section 3.2.3) and indirect (Section 3.2.4) evaluation with the root mean square error E RMSE on multiple datasets (Section 3.1). Statistical significance was assessed using the Friedman test and Nemenyi post hoc analysis (Section 3.2.6).

3.1. Datasets

To ensure a robust and comprehensive evaluation of the proposed imputation method, experiments were conducted on diverse datasets with varying properties. The selected datasets differ significantly in size, dimensionality, domain of origin, and statistical characteristics. This diversity enables testing the behaviour of the algorithm under a wide spectrum of practical conditions. Essential information (number of items and number of attributes) about the datasets is presented in Table 1. The real-world datasets used in the experiments were as follows:
  • The ‘beijing’ dataset is a large-scale dataset of hourly air pollution measurements recorded by environmental monitoring stations in Beijing, China. It contains various pollutant indicators, including PM2.5 and NO2 levels [59].
  • The ‘bias-min’ dataset contains numerical weather prediction meteorological forecast data and two in situ observations over Seoul, South Korea, in the summer [60].
  • The ‘box’ is a classical dataset that describes the concentration of carbon dioxide in a gas furnace [61].
  • The ‘carbon’ dataset contains initial and calculated atomic coordinates of carbon nanotubes [62].
  • The ‘concrete’ dataset is a well-known regression dataset that describes the compressive strength of concrete samples based on their composition [63].
  • The ‘CO2’ dataset contains real-world measurements of some air parameters in a pump deep shaft in a Polish coal mine [64].
  • The ‘methane’ dataset contains real-world measurements of air parameters in a coal mine in Upper Silesia, Poland [65].
  • The ‘power’ dataset contains data points collected from a Combined-Cycle Power Plant over 6 years [66,67].
  • The ‘wankara’ dataset contains the weather information of Ankara with the goal of predicting the mean temperature [68].

3.2. Methodology

The goal of the experimental procedure was to assess the effectiveness of the proposed granular imputation method GrImp in comparison to existing imputation techniques and marginalisation. The evaluation was based on a controlled data degradation and reconstruction process, enabling direct measurement of reconstruction error under various missing ratios.
Each of the datasets described in Section 3.1 was subjected to a systematic imputation workflow, consisting of three main phases: artificial data degradation, imputation of missing values using different techniques, and quantitative evaluation of reconstruction accuracy. In addition, statistical analyses were performed to assess the significance of observed differences.

3.2.1. Missing Data Simulation

To simulate the missing ratio, each dataset was artificially degraded by randomly masking individual attribute values following the missing completely at random (MCAR) paradigm. Each attribute value had an equal probability of being removed, regardless of its magnitude or position in the dataset. The exact missing ratios used in the direct and indirect evaluations are specified in Section 3.2.3 and Section 3.2.4.
To mitigate the influence of randomness and ensure reproducibility, each missing-ratio scenario was repeated 10 times using different random seeds. The final reported results were averaged over these 10 runs. Moreover, the same incomplete versions of the datasets across all compared imputation methods were used to guarantee fairness in evaluation.

3.2.2. Imputation Strategies

The following approaches were compared:
1.
Mean imputation—missing values were replaced with the mean of the corresponding attribute, computed over the complete data.
2.
Median imputation—similar to the mean strategy, but using the median instead. This guarantees that only existing values are used for imputation.
3.
kNN-average—imputation was based on the average attribute values among the k = 5 nearest neighbours of the incomplete instance, using Euclidean distance over the available attributes.
4.
kNN-median—same as above, but using the median instead of the average across neighbours.
5.
GrImp (granular imputation)—the method described in Section 2, based on fuzzy granules constructed from complete data and used to estimate missing values in an interpretable and structurally consistent manner.
For the GrImp method, we analysed the influence of the number of granules (G) used in the granulation step (line 5 in Algorithm 1). Specifically, values of G { 2 , 3 , 5 , 10 , 15 , 25 } were tested for each dataset and missing ratio. This enabled investigating how the resolution of data modelling—controlled by the number of fuzzy granules—affects the quality of missing-value estimation.
Additionally, marginalisation was included as a baseline. Incomplete records were removed from the dataset, i.e., only complete tuples were retained. While this is not an imputation method per se, it serves as a reference for how much information is lost when simply discarding missing data.

3.2.3. Methodology of the Direct Evaluation

To evaluate imputation accuracy, each dataset was first artificially degraded by randomly removing attribute values (MCAR mechanism) with specified missing ratios: 1%, 2%, 3%, 4%, 5%, 10%, 15%, 25%, 30%, 35%, 40%, 45%, and 50% for the direct evaluation. This resulted in a set of incomplete datasets with known ground-truth values for the missing entries.
The complete dataset, represented as the matrix A = a R × C , with R rows (each row represents one data item) and C columns (each column represents an attribute), was artificially degraded to obtain an incomplete dataset. The incomplete dataset was then imputed. The imputed dataset is represented as the matrix B = b R × C . If both matrices A and B are equal, the imputation is perfect. The quality of imputation was assessed with the Frobenius F norm A B F of the difference matrix between the original dataset A and the imputed dataset B :
F A , B = A B F = r = 1 R c = 1 C a r c b r c 2 .
If the imputation is perfect and the missing values are correctly restored, then A = B and F A , B = 0 . However, the value of F depends on the size of the dataset (number of rows R and number of columns C). This is why we introduced the second measure F RMSE , which normalises F by the number of elements R C in the dataset:
F RMSE A , B = 1 R C A B F = 1 R C r = 1 R c = 1 C a r c b r c 2 .

3.2.4. Methodology of the Indirect Evaluation

Data imputation is seldom used in isolation. It is not a goal in itself but frequently a preparatory step for further data analysis or modelling. Therefore, it is crucial to evaluate how well the imputed data support subsequent tasks, e.g., classification or regression. For the indirect evaluation, the datasets were degraded using a predefined set of missing ratios: 1%, 2%, 3%, 4%, 5%, 10%, 15%, 20%, 25%, and 30%. In our experiments, we assessed the impact of imputation on the performance of fuzzy systems for the regression task. The performance was evaluated using the E RMSE metric on a separate test set, which was not used during the imputation or training phases. Thus, imputation quality was evaluated indirectly by measuring how well the imputed data supports the learning of fuzzy systems. This approach enables assessing the practical utility of the imputation method in real-world scenarios, where the ultimate goal is often to build predictive models rather than simply fill in missing values. The motivation for using neuro-fuzzy systems is the interpretability of the resulting fuzzy models. We discuss this issue in Section 4.2. We use two very popular neuro-fuzzy system architectures: TSK [69,70] and ANNBFIS [71]. Both neuro-fuzzy systems use Gaussian fuzzy sets in the premises of the fuzzy rules.
In this approach, each dataset is split into training and test subsets. The training dataset is artificially degraded and then processed (either marginalised or imputed with all strategies mentioned above). The processed dataset is used to train a neuro-fuzzy system.
The complete (untouched) test dataset is used to evaluate the generalisation ability of the resulting fuzzy model. The task is regression modelling, and the expected output is a continuous value. The quality of the fuzzy model depends on the difference between the expected output y i and the output y i ^ produced by the model. To quantify the quality of the developed fuzzy model, the commonly used root mean square error E RMSE is calculated as
E RMSE = 1 N i = 1 N ( y i y ^ i ) 2 ,
where y i is the expected value and y ^ i is the value produced by the fuzzy model.

3.2.5. Reproducibility and Fairness

To ensure reproducibility and fair comparison, all imputation methods were applied to identically degraded datasets (i.e., using the same random seeds for missing-value removal across all methods). The entire experimental pipeline was fully automated to ensure consistency and reduce the possibility of human-induced variation.

3.2.6. Statistical Verification

In order to assess the statistical significance of the observed differences between imputation methods, we employed a two-stage statistical procedure inspired by best practices in comparative machine learning analysis:
  • Friedman test—a non-parametric test used to determine whether there are statistically significant differences between the methods across all datasets and missing ratios. The Friedman test ranks each method for each configuration and compares the average ranks.
  • Nemenyi post hoc test—if the Friedman test revealed significant differences (significance level α = 0.05 ), we applied Nemenyi post hoc pairwise comparisons to identify which pairs of methods differed significantly.
In addition to statistical significance, we also evaluated the relative effectiveness of each method by counting how often each method outperformed the others in terms of E RMSE and by measuring the average improvement margin. This provides a more interpretable measure of practical advantage beyond raw p-values.

4. Results and Discussion

The experimental results are reported in two sections: direct evaluation of imputation accuracy (Section 4.1) and indirect evaluation assessing regression performance and the interpretability of fuzzy models trained on imputed datasets (Section 4.2).

4.1. Results of the Direct Evaluation

Figure 2a presents two typical plots of the impact of the number of granules G on the imputation error E RMSE . There appears to be an optimal number of granules. In these two cases, it is 10–20 granules. A greater number of granules does not significantly reduce the imputation error but increases the imputation time. This is the elbow-shaped behaviour explained in Section 4.3.
Figure 2b shows the impact of the missing ratio on the quality of imputation. It is in agreement with the very intuitive relationship that the higher the missing ratio, the higher the imputation error.
Table 2, Table 3 and Table 4 present the results of the direct evaluation of imputation for three datasets: ‘beijing’, ‘carbon’, and ‘power’, respectively. The results are shown for different missing ratios for each dataset: 10% for ‘beijing’, 20% for ‘carbon’, and 30% for ‘power’. In all these examples, the GrImp method outperformed the other imputation strategies, achieving the lowest Frobenius norm (F) and Frobenius root mean square error ( F RMSE ) values.
To complement the error-based evaluation, Table 5 reports the wall-clock execution times obtained for all imputers on the largest dataset, ‘beijing’, with a missing ratio of m = 0.20 . A very clear difference in computational cost is observed between the kNN imputers and the remaining methods. The classical kNN variants required approximately 67–68 s to complete the imputation, whereas the proposed granular imputer finished within 0.13–1.05 s, depending on the number of granules. This means that even the largest granular configuration ( G = 30 ) was nearly two orders of magnitude faster than kNN, while small and medium granularities achieved speed-ups exceeding a factor of 500. These results highlight the substantial computational advantage of the granular approach, whose runtime scales much more favourably with dataset size.

4.2. Results of the Indirect Evaluation

The effectiveness of each imputation method was evaluated using the E RMSE metric on artificially degraded datasets, as described in Section 3.2. To better illustrate the comparative performance, the results were visualised using heatmaps that highlight the median E RMSE differences between the imputation methods under various configurations.
Figure 3 presents selected examples of these heatmaps, each corresponding to a specific dataset, imputation scenario, and fuzzy system used in subsequent modelling. These heatmaps summarise the relative ranking of methods based on the median E RMSE , computed across 10 random repetitions for each missing ratio. Positive values indicate that the method on the vertical axis outperformed the method on the horizontal axis.
Across the experiments, the proposed granular imputation method GrImp consistently achieved comparable or superior reconstruction performance compared to standard imputation techniques, particularly under moderate-to-high missing ratios (10–30%). In many cases, GrImp outperformed even the widely used kNN-based approaches. Notably, the optimal number of granules depended on the specific characteristics of the dataset—in particular, its size and the number of attributes—suggesting that model resolution should be adapted to the complexity of the data.
The advantages of the granular approach are particularly evident in Figure 3c, which presents the results for the ‘beijing’ dataset at a missing ratio of 30%. When the number of granules was set to 5 or 10, the GrImp method significantly outperformed both kNN-based imputations (mean and median variants). This result is especially noteworthy given that kNN imputations typically scale poorly with dataset size due to their quadratic time complexity, while GrImp offers a linear-time alternative when granule membership calculations are efficiently implemented.
Moreover, the experimental results indicate that the number of granules (G) significantly affects imputation performance. For several datasets, noticeable changes in E RMSE were observed as G increased, suggesting that both under- and over-granulation can affect reconstruction quality. Using too few granules may lead to an oversimplified representation of the data that fails to capture local patterns and variability—an effect especially visible in high-dimensional datasets such as ‘bias-min’. Conversely, a high number of granules (e.g., G = 25 ) sometimes resulted in increased variability across repetitions, particularly in smaller datasets like ‘box’ and ‘methane’. This may reflect overfitting to noise or local fluctuations in the incomplete samples. However, we did not observe numerical instability in terms of convergence or execution errors; all experiments completed successfully.
Across most datasets and configurations, intermediate values of G (typically 5 or 10) provided a favourable balance between reconstruction accuracy and stability. These settings consistently achieved low E RMSE values while avoiding the performance degradation and increased variance seen at the extremes of granule count.
Furthermore, the optimal value of G was found to be dataset-dependent and was particularly sensitive to the number of attributes (dimensionality) and records (cardinality). In general, larger datasets with more features tended to benefit from a higher number of granules, while smaller datasets achieved better results with fewer granules. This observation provides a practical guideline: the level of granularity should be adapted to the scale and complexity of the dataset to avoid underfitting or overfitting during the imputation process.
Practically, GrImp offers not only competitive reconstruction quality but also a structured and transparent imputation mechanism. Each imputed value is obtained through a weighted combination of locally relevant granules, whose centres and fuzziness parameters are explicitly available. This transparency improves interpretability by revealing which granules contributed to each completion and to what extent.

Interpretability of Fuzzy Models

The interpretability of fuzzy models trained on imputed datasets is a relevant and important aspect of explainable artificial intelligence (XAI). This section compares the models trained on datasets imputed with different methods.
Figure 4 presents the premise parts of the TSK fuzzy models constructed for the ‘power’ dataset with 5% of missing values. This dataset contains four input attributes and one output attribute. TSK neuro-fuzzy systems have Gaussian fuzzy sets in the premises of the rules. Each subplot in Figure 4 for x 1 , , x 4 presents the Gaussian fuzzy sets in the input domain for attributes 1 to 4, respectively. The model in Figure 4a was trained on the dataset imputed with the kNN average imputer, while the model in Figure 4b was trained on the dataset imputed with the granular imputer ( G = 2 ). It can be observed that the model trained on the dataset imputed with the kNN average imputer has poorer interpretability. The premise fuzzy sets are not well separated. Some premise fuzzy sets are excessively wide. Such fuzzy sets carry limited semantic information. This behaviour is observed across all four attributes. In contrast, the model trained on the dataset imputed with the granular imputer has well-separated premises. The fuzzy sets are narrower and more informative. The semantics of such fuzzy sets are better. To quantify the interpretability, we use the interpretability index I fspe proposed for fuzzy systems with Gaussian fuzzy sets in the premises [46,56]. Higher values of I fspe indicate better interpretability of the fuzzy model. For the model trained on the dataset imputed with the kNN average imputer, the interpretability index is I fspe = 0.849 , while for the model trained on the dataset imputed with the granular imputer, the interpretability index is I fspe = 0.892 .
The same phenomenon can be observed in Figure 5. The figure presents the Gaussian fuzzy sets in the premises of the ANNBFIS fuzzy models trained on the ‘wankara’ dataset with 1% of missing values. The model in Figure 5a was trained on the dataset imputed with the average imputer, while the model in Figure 5b was trained on the dataset imputed with the granular imputer with G = 10 granules. Again, the model trained on the dataset imputed with the average imputer ( I fspe = 0.723 ) has poorer interpretability than the model trained on the dataset imputed with the granular imputer ( I fspe = 0.836 ).
Figure 6 presents the premises of the ANNBFIS fuzzy models trained on the ‘methane’ dataset with 2% of missing values and imputed with the granular imputer. The model in Figure 6a was trained on the dataset imputed with G = 2 granules, while the model in Figure 6b was trained on the dataset imputed with G = 10 granules. It can be observed that the model trained on the dataset imputed with G = 10 granules has better interpretability ( I fspe = 0.833 ). The premises are better separated. For each attribute, each pair of fuzzy sets has at most one crossing point. This is one interpretability criterion for fuzzy models. This makes it easier to assign linguistic labels to fuzzy sets. The premises for datasets imputed with G = 2 granules are less structured, significantly overlap, are more difficult to interpret, and make semantic labelling more tedious ( I fspe = 0.679 ).
However, it should be clearly stated that it is not always the case that a higher number of granules leads to better interpretability of the fuzzy model. This behaviour depends on the characteristics of the dataset. Providing a universal recommendation for the number of granules for the granular imputer is challenging. This requires further research.

4.3. Number of Granules

While the choice of the number of granules is inherently a complex, data-dependent problem that needs further investigation, our experiments provide some preliminary size-based guidelines. Across all studied cases, the proposed GrImp method achieved its largest gains for missing ratios between 10% and 30%, but it remained competitive outside this range. More importantly, the dominant factor influencing the optimal number of granules was the number of data items rather than the missing ratio itself.
For datasets with approximately 200 X 1500 data items, the optimal number of granules was typically found within the set G [ 5 ; 10 ] , which offered the best results for the specific set of tested values given by the ‘box’, ‘concrete’, and ‘methane’ datasets. For medium-sized datasets with 1500 X 3000 items, the best-performing configurations corresponded to G [ 10 ; 20 ] , with the evaluated datasets being the ‘CO2’ and ‘wankara’ datasets. For the largest evaluated datasets (‘beijing’, ‘bias-min’, ‘carbon’, and ‘power’) with X 3000 items, effective values were observed in the range G [ 15 ; 25 ] .
Across all regimes, we observed a characteristic elbow point beyond which further increasing the number of granules G did not result in a significant improvement in performance on validation data (Figure 2a). Practitioners are therefore advised to treat these intervals as a starting point and to refine G for their specific datasets.
As a compact rule of thumb, practitioners may use the following empirical approximation for the number of granules G derived on the basis of the observed trends (Figure 7):
G ( X ) max { 5 , 8 + 8 log 10 ( X / 200 ) } ,
where X denotes the number of data items in a dataset. This formula is not intended as a universal prescription but rather as a coarse, data-driven approximation summarising the behaviour observed across the benchmark datasets. It provides a convenient initial guess for G that can subsequently be refined through validation on a specific dataset.

4.4. Key Findings

Based on the results obtained in the direct evaluation setting (Section 4.1) and the indirect evaluation via predictive modelling for regression (Section 4.2), several consistent and dataset-independent patterns were observed across all experiments:
  • Dependence on dataset size. The optimal number of granules G is primarily determined by the number of data items in the dataset. Larger datasets consistently benefit from higher granularity, whereas smaller datasets require more conservative settings to avoid fragmentation effects.
  • Overgranulation and the elbow effect. Excessively large values of G lead to decreased stability of the reconstructed values, manifesting as an elbow point beyond which variance increases and reconstruction quality deteriorates.
  • Superior performance of GrImp. Across all tested missing-data ratios, GrImp outperformed classical statistical imputers (mean and median) as well as the kNN-based baselines, demonstrating consistently lower reconstruction error.
  • Largest gains at moderate missing ratios. The most substantial improvements for GrImp over the reference methods were observed for the missing ratio in the range of 10–30%, although the method remained competitive outside this interval.
  • Stability across repetitions. The method exhibited low sensitivity to random initialisation and sampling variation, producing stable results across repeated runs in both direct and indirect evaluation settings.

5. Conclusions

The experiments conducted in this study demonstrate that the proposed granular imputation method (GrImp) provides a competitive and often superior alternative to conventional imputation techniques such as mean, median, and kNN-based methods. In both direct and indirect evaluation settings, GrImp achieves lower or comparable imputation errors ( E RMSE and F RMSE ) across a wide range of missing ratios and datasets. Its advantages are most pronounced for moderate to high levels of incompleteness (10–30%), where traditional statistical imputers tend to lose precision and kNN-based methods suffer from scalability issues.
A key finding is that the number of granules G used in the imputation process plays a crucial role in balancing accuracy, stability, and interpretability. Intermediate values of G (typically between 5 and 10) were found to offer the most favourable trade-off, yielding consistently low imputation errors while maintaining stable reconstruction performance across repetitions. Excessive granularity ( G > 20 ) does not significantly improve results and, in some cases, leads to minor degradation in accuracy, particularly for small datasets. Conversely, too few granules cause oversimplification of data representation, which reduces reconstruction fidelity in high-dimensional settings.
Importantly, the granular approach offers interpretability benefits that go beyond numerical performance. Fuzzy models trained on datasets imputed with GrImp exhibit clearer, better-separated premises and narrower, more semantically meaningful fuzzy sets. This property supports explainability—a desirable feature in modern data-driven systems—by preserving the structural coherence of the reconstructed data and improving the transparency of subsequent fuzzy modelling.
From a methodological standpoint, the experiments confirm that the performance of imputation methods should not be judged solely by reconstruction error. Indirect evaluation through regression tasks reveals that high-quality imputation translates into improved predictive accuracy of fuzzy systems, validating the practical utility of the granular approach in real-world analytical pipelines.
Finally, it must be emphasised that the optimal configuration of the granular imputer is data-dependent. The number of granules should be adapted to the dataset’s size and dimensionality, ensuring that the imputer neither underfits nor overfits the data structure.
Future work is therefore planned to address adaptive strategies for automatic granule-number selection and investigate the interplay between granularity and interpretability in broader classes of neuro-fuzzy models. An interesting issue worth investigating is the granulation algorithm. In our approach, we use the FCM method, but other algorithms may be considered, for instance, an algorithm robust to outliers that may distort the imputation of missing values. One more point of interest is the impact of t-norms used in the GrImp algorithm on the quality of imputation.

Author Contributions

Conceptualisation, K.S. and K.W.; methodology, K.S. and K.W.; software, K.S. and K.W.; validation, K.S. and K.W.; formal analysis, K.S. and K.W.; investigation, K.S. and K.W.; writing—original draft preparation, K.S. and K.W.; writing—review and editing, K.S. and K.W.; visualisation, K.S. and K.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Silesian University of Technology, Poland.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this research are publicly available in the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/index.php, accessed on 5 November 2025) and the Kaggle platform (https://www.kaggle.com/, accessed on 5 November 2025). The C++ implementation of the proposed granular imputation method GrImp is available in a public GitHub repository www.github.com/ksiminski/neuro-fuzzy-library (accessed on 5 November 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
GrImpGranular imputer
TSKTakagi–Sugeno–Kang neuro-fuzzy system architecture
ANNBFISArtificial Neural Network-Based Fuzzy Inference System

References

  1. Renz, C.; Rajapakse, J.C.; Razvi, K.; Liang, S.K.C. Ovarian cancer classification with missing data. In Proceedings of the 9th International Conference on Neural Information Processing, ICONIP’02, Singapore, 18–22 November 2002; Volume 2, pp. 809–813. [Google Scholar]
  2. Acuña, E.; Rodriguez, C. The treatment of missing values and its effect in the classifier accuracy. In Classification, Clustering and Data Mining Applications; Banks, D., House, L., McMorris, F.R., Arabie, P., Gaul, W., Eds.; Springer: Berlin/Heidelberg, Germany, 2004; pp. 639–648. [Google Scholar]
  3. Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, R.B. Missing value estimation methods for dna microarrays. Bioinformatics 2001, 17, 520–525. [Google Scholar] [CrossRef]
  4. Cooke, M.; Green, P.; Josifovski, L.; Vizinho, A. Robust automatic speech recognition with missing and unreliable acoustic data. Speech Commun. 2001, 34, 267–285. [Google Scholar] [CrossRef]
  5. Siminski, K. Imputation of missing values by inversion of fuzzy neuro-system. In Man-Machine Interactions 4; Springer International Publishing: Cham, Switzerland, 2016; pp. 573–582. [Google Scholar]
  6. Grzymała-Busse, J. A rough set approach to data with missing attribute values. In Rough Sets and Knowledge Technology; Lecture Notes in Computer Science; Wang, G., Peters, J., Skowron, A., Yao, Y., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; Volume 4062, pp. 58–67. [Google Scholar]
  7. Nowicki, R. Rough-neuro-fuzzy system with MICOG defuzzification. In Proceedings of the 2006 IEEE International Conference on Fuzzy Systems, Vancouver, QC, Canada, 16–21 July 2006; pp. 1958–1965. [Google Scholar]
  8. Siminski, K. Clustering with missing values. Fundam. Informaticae 2013, 123, 331–350. [Google Scholar] [CrossRef]
  9. Siminski, K. Rough subspace neuro-fuzzy system. Fuzzy Sets Syst. 2015, 269, 30–46. [Google Scholar] [CrossRef]
  10. Himmelspach, L.; Conrad, S. Fuzzy clustering of incomplete data based on cluster dispersion. In Computational Intelligence for Knowledge-Based Systems Design; Lecture Notes in Computer Science; Hüllermeier, E., Kruse, R., Hoffmann, F., Eds.; Springer: Berlin/Heidelberg, Germany, 2010; Volume 6178, pp. 59–68. [Google Scholar]
  11. Batista, G.E.; Monard, M.C. A study of k-nearest neighbour as an imputation method. Front. Artif. Intell. Appl. 2002, 87, 251–260. [Google Scholar]
  12. Gupta, A.; Lam, M.S. Estimating missing values using neural networks. J. Oper. Res. Soc. 1996, 47, 229–238. [Google Scholar]
  13. Siminski, K. Ridders algorithm in approximate inversion of fuzzy model with parameterized consequences. Expert Syst. Appl. 2016, 51, 276–285. [Google Scholar] [CrossRef]
  14. Yoon, J.; Jordon, J.; van der Schaar, M. GAIN: Missing data imputation using generative adversarial nets. In Proceedings of the 35 International Conference on Machine Learning, PMLR 80, Stockholm, Sweden, 10–15 July 2018; pp. 1–10. [Google Scholar]
  15. McCoy, J.T.; Kroon, S.; Auret, L. Variational autoencoders for missing data imputation with application to a simulated milling circuit. In Proceedings of the 5th IFAC Workshop on Mining, Mineral and Metal Processing, Shanghai, China, 23–25 August 2018; pp. 141–146. [Google Scholar]
  16. Qiu, Y.L.; Zheng, H.; Gevaert, O. Genomic data imputation with variational auto-encoders. GigaScience 2020, 9, 1–12. [Google Scholar] [CrossRef]
  17. Matyja, A.; Siminski, K. Comparison of algorithms for clustering incomplete data. Found. Comput. Decis. Sci. 2014, 39, 107–127. [Google Scholar] [CrossRef]
  18. Zadeh, L.A. Fuzzy sets and information granularity. In Advances in Fuzzy Set Theory and Applications; Gupta, N., Ragade, R., Yager, R., Eds.; North-Holland Publishing Co.: Amsterdam, The Netherlands, 1979; pp. 3–18. [Google Scholar]
  19. Yao, Y. The art of granular computing. In Rough Sets and Intelligent Systems Paradigms; Kryszkiewicz, M., Peters, J.F., Rybinski, H., Skowron, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2007; pp. 101–112. [Google Scholar]
  20. Yao, Y. A triarchic theory of granular computing. Granul. Comput. 2016, 1, 145–157. [Google Scholar] [CrossRef]
  21. Pedrycz, W.; Succi, G.; Sillitti, A.; Iljazi, J. Data description: A general framework of information granules. Knowl.-Based Syst. 2015, 80, 98–108. [Google Scholar] [CrossRef]
  22. Piegat, A.; Pluciński, M. Computing with words with the use of inverse rdm models of membership functions. Int. J. Appl. Math. Comput. Sci. 2015, 25, 675–688. [Google Scholar] [CrossRef]
  23. Zadeh, L.A. From computing with numbers to computing with words–From manipulation of measurements to manipulation of perceptions. Int. J. Appl. Math. Comput. Sci. 2002, 12, 307–324. [Google Scholar]
  24. Ciucci, D. Orthopairs and granular computing. Granul. Comput. 2016, 1, 159–170. [Google Scholar] [CrossRef]
  25. Pedrycz, W. Granular Computing: Analysis and Design of Intelligent Systems; CRC Press: Boca Raton, FL, USA, 2013. [Google Scholar]
  26. Shifei, D.; Li, X.; Hong, Z.; Liwen, Z. Research and progress of cluster algorithms based on granular computing. Int. J. Digit. Content Technol. Its Appl. 2010, 4, 96–104. [Google Scholar]
  27. Siminski, K. An outlier-robust neuro-fuzzy system for classification and regression. Int. J. Appl. Math. Comput. Sci. 2021, 31, 303–319. [Google Scholar] [CrossRef]
  28. Siminski, K. Prototype based granular neuro-fuzzy system for regression task. Fuzzy Sets Syst. 2022, 449, 56–78. [Google Scholar] [CrossRef]
  29. Suchy, D.; Siminski, K. Grdbscan: A granular density-based clustering algorithm. Int. J. Appl. Math. Comput. Sci. 2023, 33, 297–312. [Google Scholar] [CrossRef]
  30. Yao, Y.; Zhong, N. Granular computing. In Wiley Encyclopedia of Computer Science and Engineering; Wah, B.W., Ed.; Wiley: Hoboken, NJ, USA, 2007. [Google Scholar]
  31. Yao, Y. Granular computing: Past, present and future. In Proceedings of the 2008 IEEE International Conference on Granular Computing, GrC 2008, Hangzhou, China, 26–28 August 2008; pp. 80–85. [Google Scholar]
  32. Yao, Y. Three-way decision and granular computing. Int. J. Approx. Reason. 2018, 103, 107–123. [Google Scholar] [CrossRef]
  33. Pieta, P.; Szmuc, T. Applications of rough sets in big data analysis: An overview. Int. J. Appl. Math. Comput. Sci. 2021, 31, 659–683. [Google Scholar] [CrossRef]
  34. Skowron, A.; Jankowski, A.; Dutta, S. Interactive granular computing. Granul. Comput. 2016, 1, 95–113. [Google Scholar] [CrossRef]
  35. Pedrycz, A.; Hirota, K.; Pedrycz, W.; Dong, F. Granular representation and granular computing with fuzzy sets. Fuzzy Sets Syst. 2012, 203, 17–32. [Google Scholar] [CrossRef]
  36. Xia, S.; Chen, L.; Liu, S.; Yang, H. A new method for decision making problems with redundant and incomplete information based on incomplete soft sets: From crisp to fuzzy. Int. J. Appl. Math. Comput. Sci. 2022, 32, 657–669. [Google Scholar] [CrossRef]
  37. Siminski, K. GrNFS: Granular neuro-fuzzy system for regression in large volume data. Int. J. Appl. Math. Comput. Sci. 2021, 31, 445–459. [Google Scholar] [CrossRef]
  38. Siminski, K. 3WDNFS—Three-way decision neuro-fuzzy system for classification. Fuzzy Sets Syst. 2023, 466, 108432. [Google Scholar] [CrossRef]
  39. Yao, Y. Three-way decision: An interpretation of rules in rough set theory. In Rough Sets and Knowledge Technology; Wen, P., Li, Y., Polkowski, L., Yao, Y., Tsumoto, S., Wang, G., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; pp. 642–649. [Google Scholar]
  40. Yao, Y. The superiority of three-way decisions in probabilistic rough set models. Inf. Sci. 2011, 181, 1080–1096. [Google Scholar] [CrossRef]
  41. Cpalka, K.; Lapa, K.; Przybyl, A.; Zalasinski, M. A new method for designing neuro-fuzzy systems for nonlinear modelling with interpretability aspects. Neurocomputing 2014, 135, 203–217. [Google Scholar] [CrossRef]
  42. Pancho, D.P.; Alonso, J.M.; Alcalá-Fdez, J.; Magdalena, L. Interpretability analysis of fuzzy association rules supported by fingrams. In Proceedings of the 8th Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT 2013), Milan, Italy, 11–13 September 2013; pp. 469–474. [Google Scholar]
  43. Siminski, K.; Wnuk, K. Automatic extraction of linguistic description from fuzzy rule base. arXiv 2024, arXiv:2404.03058. [Google Scholar] [CrossRef]
  44. Yousefi, J. A modified nefclass classifier with enhanced accuracy-interpretability trade-off for datasets with skewed feature values. Fuzzy Sets Syst. 2021, 413, 99–113. [Google Scholar] [CrossRef]
  45. Leski, J.M. Fuzzy (c+p)-means clustering and its application to a fuzzy rule-based classifier: Towards good generalization and good interpretability. IEEE Trans. Fuzzy Syst. 2015, 23, 802–812. [Google Scholar] [CrossRef]
  46. Lapa, K.; Cpalka, K.; Rutkowski, L. New Aspects of Interpretability of Fuzzy Systems for Nonlinear Modeling; Springer International Publishing: Cham, Switzerland, 2018; pp. 225–264. [Google Scholar]
  47. Magdalena, L. Fuzzy Systems Interpretability: What, Why and How; Springer International Publishing: Cham, Switzerland, 2021; pp. 111–122. [Google Scholar]
  48. Pan, L.; Gao, C.; Zhou, J.; Chen, G.; Yue, X. Three-way decision-based Takagi-Sugeno-Kang fuzzy classifier for partially labeled data. Appl. Soft Comput. 2024, 164, 112010. [Google Scholar] [CrossRef]
  49. Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
  50. Slowik, A.; Cpalka, K.; Lapa, K. Multipopulation nature-inspired algorithm (MNIA) for the designing of interpretable fuzzy systems. IEEE Trans. Fuzzy Syst. 2020, 28, 1125–1139. [Google Scholar] [CrossRef]
  51. de Oliveira, J.V. Semantic constraints for membership function optimization. IEEE Trans. Syst. Man Cybern.-Part A Syst. Humans 1999, 29, 128–138. [Google Scholar] [CrossRef]
  52. Gacto, M.J.; Alcala, R.; Herrera, F. Interpretability of linguistic fuzzy rule-based systems: An overview of interpretability measures. Inf. Sci. 2011, 181, 4340–4360. [Google Scholar] [CrossRef]
  53. Mencar, C. Interpretability of fuzzy systems. In Fuzzy Logic and Applications; Springer International Publishing: Cham, Switzerland, 2013; pp. 22–35. [Google Scholar]
  54. Saaty, T.L.; Ozdemir, M.S. Why the magic number seven plus or minus two. Math. Comput. Model. 2003, 38, 233–244. [Google Scholar] [CrossRef]
  55. Zhou, S.-M.; Gan, J.Q. Low-level interpretability and high-level interpretability: A unified view of data-driven interpretable fuzzy system modelling. Fuzzy Sets Syst. 2008, 159, 3091–3131. [Google Scholar] [CrossRef]
  56. Siminski, K. FuBiNFS—Fuzzy biclustering neuro-fuzzy system. Fuzzy Sets Syst. 2022, 438, 84–106. [Google Scholar] [CrossRef]
  57. Chen, M.; Zhang, L.; Lu, W.; Liu, X.; Pedrycz, W. Imputation for incomplete data based on granulated single output neural network group. Expert Syst. Appl. 2026, 297, 129321. [Google Scholar] [CrossRef]
  58. Hu, X.; Pedrycz, W.; Wu, K.; Shen, Y. Information granule-based classifier: A development of granular imputation of missing data. Knowl.-Based Syst. 2021, 214, 106737. [Google Scholar]
  59. Liang, X.; Zou, T.; Guo, B.; Li, S.; Zhang, H.; Zhang, S.; Huang, H.; Chen, S.X. Assessing Beijing’s PM2.5 pollution: Severity, weather impact, apec and winter heating. Proc. R. Soc. A 2015, 471, 257–276. [Google Scholar] [CrossRef]
  60. Cho, D.; Yoo, C.; Im, J.; Cha, D.-H. Comparative assessment of various machine learning-based bias correction methods for numerical weather prediction model forecasts of extreme air temperatures in urban areas. Earth Space Sci. 2020, 7, e2019EA000740. [Google Scholar] [CrossRef]
  61. Box, G.E.P.; Jenkins, G. Time Series Analysis, Forecasting and Control; Holden-Day, Incorporated: Oakland, CA, USA, 1970. [Google Scholar]
  62. Acı, M.; Avcı, M. Artificial neural network approach for atomic coordinate prediction of carbon nanotubes. Appl. Phys. A 2016, 122, 631. [Google Scholar] [CrossRef]
  63. Yeh, I.C. Modeling of strength of high-performance concrete using artificial neural networks. Cem. Concr. Res. 1998, 28, 1797–1808. [Google Scholar] [CrossRef]
  64. Sikora, M.; Krzykawski, D. Application of data exploration methods in analysis of carbon dioxide emission in hard-coal mines dewater pump stations. Mech. Autom. Gor. 2005, 413, 57–67. [Google Scholar]
  65. Sikora, M.; Sikora, B. Application of machine learning for prediction a methane concentration in a coal-mine. Arch. Min. Sci. 2006, 51, 475–492. [Google Scholar]
  66. Kaya, H.; Tüfekci, P.; Gürgen, S.F. Local and global learning methods for predicting power of a combined gas and steam turbine. In Proceedings of the International Conference on Emerging Trends in Computer and Electronics Engineering (ICETCEE 2012), Dubai, United Arab Emirates, 24–25 March 2012; pp. 13–18. [Google Scholar]
  67. Tüfekci, P. Prediction of full load electrical power output of a base load operated combined cycle power plant using machine learning methods. Int. J. Electr. Power Energy Syst. 2014, 60, 126–140. [Google Scholar] [CrossRef]
  68. Alcalá-Fdez, J.; Fernandez, A.; Luengo, J.; Derrac, J.; García, S.; Sánchez, L.; Herrera, F. KEEL data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. J. Mult.-Valued Log. Soft Comput. 2011, 17, 255–287. [Google Scholar]
  69. Sugeno, M.; Kang, G.T. Structure identification of fuzzy model. Fuzzy Sets Syst. 1988, 28, 15–33. [Google Scholar] [CrossRef]
  70. Takagi, T.; Sugeno, M. Fuzzy identification of systems and its application to modeling and control. IEEE Trans. Syst. Man Cybern. 1985, 15, 116–132. [Google Scholar] [CrossRef]
  71. Czogala, E.; Leski, J. Fuzzy and Neuro-Fuzzy Intelligent Systems; Series Studies in Fuzziness and Soft Computing; Physica-Verlag, A Springer-Verlag Company: Heidelberg, Germany, 2000. [Google Scholar]
Figure 1. Example of granular imputation of a missing value with three granules: g 1 , g 2 , and g 3 . The tuple x i = [ x i , ? ] has one missing attribute represented with a question mark. The three granules have centres v 1 = [ x 1 , y 1 ] , v 2 = [ x 2 , y 2 ] , and v 3 = [ x 3 , y 3 ] , respectively. The missing value in x i is imputed three times, each time using the corresponding attribute from one of the granule centres, resulting in three imputed tuples: x ¯ g 1 = [ x i , y 1 ] , x ¯ g 2 = [ x i , y 2 ] , and x ¯ g 3 = [ x i , y 3 ] .
Figure 1. Example of granular imputation of a missing value with three granules: g 1 , g 2 , and g 3 . The tuple x i = [ x i , ? ] has one missing attribute represented with a question mark. The three granules have centres v 1 = [ x 1 , y 1 ] , v 2 = [ x 2 , y 2 ] , and v 3 = [ x 3 , y 3 ] , respectively. The missing value in x i is imputed three times, each time using the corresponding attribute from one of the granule centres, resulting in three imputed tuples: x ¯ g 1 = [ x i , y 1 ] , x ¯ g 2 = [ x i , y 2 ] , and x ¯ g 3 = [ x i , y 3 ] .
Axioms 14 00887 g001
Figure 2. Quality of imputation verified with the direct method. (a) Impact of the number of granules on the quality of imputation. (b) Impact of the missing ratio on the quality of imputation.
Figure 2. Quality of imputation verified with the direct method. (a) Impact of the number of granules on the quality of imputation. (b) Impact of the missing ratio on the quality of imputation.
Axioms 14 00887 g002
Figure 3. Pairwise comparison of imputation methods in terms of median E RMSE differences across different datasets and NFS models. Each heatmap shows the difference in performance between methods; positive values indicate better performance of the row method compared to the column method. This is an indirect comparison based on relative median errors between methods, not absolute E RMSE values.
Figure 3. Pairwise comparison of imputation methods in terms of median E RMSE differences across different datasets and NFS models. Each heatmap shows the difference in performance between methods; positive values indicate better performance of the row method compared to the column method. This is an indirect comparison based on relative median errors between methods, not absolute E RMSE values.
Axioms 14 00887 g003
Figure 4. Gaussian fuzzy sets in the premises of the TSK fuzzy models trained on the ‘power’ dataset with 5% of missing values.
Figure 4. Gaussian fuzzy sets in the premises of the TSK fuzzy models trained on the ‘power’ dataset with 5% of missing values.
Axioms 14 00887 g004
Figure 5. Gaussian fuzzy sets in the premises of the ANNBFIS fuzzy models trained on the ‘wankara’ dataset with 1% of missing values.
Figure 5. Gaussian fuzzy sets in the premises of the ANNBFIS fuzzy models trained on the ‘wankara’ dataset with 1% of missing values.
Axioms 14 00887 g005
Figure 6. Gaussian fuzzy sets in the premises of the ANNBFIS fuzzy models trained on the ‘methane’ dataset with 2% of missing values and imputed with a granular imputer ( G = 2 and G = 10 ).
Figure 6. Gaussian fuzzy sets in the premises of the ANNBFIS fuzzy models trained on the ‘methane’ dataset with 2% of missing values and imputed with a granular imputer ( G = 2 and G = 10 ).
Axioms 14 00887 g006
Figure 7. Experimental approximation of the optimal number of granules G as a function of the number of data items X in the dataset (Equation (4)).
Figure 7. Experimental approximation of the optimal number of granules G as a function of the number of data items X in the dataset (Equation (4)).
Axioms 14 00887 g007
Table 1. Essential information (number of data items and number of attributes) about the datasets used in the experiments.
Table 1. Essential information (number of data items and number of attributes) about the datasets used in the experiments.
Dataset NameNumber of Data ItemsNumber of Attributes
‘beijing’41,7576
‘bias-min’759025
‘box’29011
‘carbon’10,7206
‘concrete’10309
‘CO2265313
‘methane’10228
‘power’95685
‘wankara’160810
Table 2. Results of the direct evaluation of imputation for the ‘beijing’ dataset with a missing ratio of m = 0.1 . The table presents the best results obtained for the dataset sorted with respect to F and F RMSE .
Table 2. Results of the direct evaluation of imputation for the ‘beijing’ dataset with a missing ratio of m = 0.1 . The table presents the best results obtained for the dataset sorted with respect to F and F RMSE .
MethodParametersF F RMSE
GrImp G = 25 21.2170.00008468
GrImp G = 30 21.2300.00008473
GrImp G = 20 21.2400.00008477
kNN-average k = 5 23.0530.00009201
kNN-median k = 5 24.1490.00009639
average 29.6800.00011846
median 29.8080.00011897
Table 3. Results of the direct evaluation of imputation for the ‘carbon’ dataset with a missing ratio of m = 0.2 .
Table 3. Results of the direct evaluation of imputation for the ‘carbon’ dataset with a missing ratio of m = 0.2 .
MethodParametersF F RMSE
GrImp G = 5 26.6870.00041491
GrImp G = 20 26.6930.00041501
GrImp G = 10 26.7020.00041515
GrImp G = 25 26.7100.00041528
GrImp G = 30 26.7140.00041533
GrImp G = 2 26.7610.00041606
kNN-average k = 5 27.0350.00042032
average 30.6650.00047677
kNN-median k = 5 30.7930.00047875
median 30.8290.00047931
Table 4. Results of the direct evaluation of imputation for the ‘power’ dataset with a missing ratio of m = 0.3 .
Table 4. Results of the direct evaluation of imputation for the ‘power’ dataset with a missing ratio of m = 0.3 .
MethodParametersF F RMSE
GrImp G = 30 12.8700.00026904
GrImp G = 25 12.9060.00026978
GrImp G = 20 12.9480.00027066
GrImp G = 10 13.1330.00027452
GrImp G = 5 13.5440.00028312
kNN-average k = 5 14.0810.00029434
GrImp G = 3 14.1650.00029610
kNN-median k = 5 14.6750.00030676
GrImp G = 2 15.0130.00031383
average 21.0710.00044045
median 21.2410.00044400
Table 5. Runtime comparison for the ‘beijing’ dataset with a missing ratio of m = 0.2 . The table reports the wall-clock execution times for all evaluated imputers, including classical baselines, kNN, and granular variants.
Table 5. Runtime comparison for the ‘beijing’ dataset with a missing ratio of m = 0.2 . The table reports the wall-clock execution times for all evaluated imputers, including classical baselines, kNN, and granular variants.
MethodParametersTime [s]
average 0.0738
median 0.0721
GrImp G = 2 0.1268
GrImp G = 3 0.1558
GrImp G = 5 0.2263
GrImp G = 10 0.3918
GrImp G = 20 0.7575
GrImp G = 25 0.8821
GrImp G = 30 1.0468
kNN-average k = 5 67.6906
kNN-median k = 5 67.4996
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Siminski, K.; Wnuk, K. GrImp: Granular Imputation of Missing Data for Interpretable Fuzzy Models. Axioms 2025, 14, 887. https://doi.org/10.3390/axioms14120887

AMA Style

Siminski K, Wnuk K. GrImp: Granular Imputation of Missing Data for Interpretable Fuzzy Models. Axioms. 2025; 14(12):887. https://doi.org/10.3390/axioms14120887

Chicago/Turabian Style

Siminski, Krzysztof, and Konrad Wnuk. 2025. "GrImp: Granular Imputation of Missing Data for Interpretable Fuzzy Models" Axioms 14, no. 12: 887. https://doi.org/10.3390/axioms14120887

APA Style

Siminski, K., & Wnuk, K. (2025). GrImp: Granular Imputation of Missing Data for Interpretable Fuzzy Models. Axioms, 14(12), 887. https://doi.org/10.3390/axioms14120887

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop