On Granular Rough Computing: Handling Missing Values by Means of Homogeneous Granulation †

: This paper is a continuation of works based on a previously developed new granulation method—homogeneous granulation. The most important new feature of this method compared to our previous ones is that there is no need to estimate optimal parameters. Approximation parameters are selected dynamically depending on the degree of homogeneity of decision classes. This makes the method fast and simple, which is an undoubted advantage despite the fact that it gives a slightly lower level of approximation to our other techniques. In this particular article, we are presenting its performance in the process of missing values absorption. We test selected strategies on synthetically damaged data from the UCI repository. The added value is to investigate the speciﬁc performance of our new granulation technique in absorbing missing values. The effectiveness of their absorption in the granulation process has been conﬁrmed in our experiments.


Introduction
Granular computing is a paradigm, dedicated to computing, based on objects similar to each other on the basis of selected similarity measure. The idea was proposed by Lotfi Zadeh [1,2]. Granulation is a part of the fuzzy theory by the very definition of fuzzy set, where inverse values of fuzzy membership functions are the basic forms of granules. Shortly after Lotfi Zadeh proposed the idea of granular computing, the granules were introduced in terms of rough set theory by T.Y. Lin, L. Polkowski, and A. Skowron. In rough set theory, granules are defined as classes of indiscernibility relations. Interesting research on more flexible granules based on blocks was conducted by (Grzymala-Busse), and templates by (H.S. Nguyen). The granules based on rough inclusions was introduced by (Polkowski and Skowron [3]), based on tolerance or similarity relations, and, more generally, binary relations by (T.Y. Lin [4], Y. Y. Yao [5][6][7]). Being in the context of rough mereology was proposed by L. Polkowski and A.Skowron, approximation spaces by A. Skowron and J. Stepaniuk [8,9], and logic for approximate reasoning by L.Polkowski and M. Semeniuk-Polkowska [10], and Qing Liu [11]. Examples of interesting studies from recent years can be found in [12][13][14][15][16][17][18].
This is a work about using granular rough computing techniques to absorb missing values [19]. The exact theoretical introduction to the family of approximation methods to which our methods belong to can be found in [20][21][22]. Of course, to understand the body of the algorithmic work, we have included all the relevant details in the following sections.
In the granulation process, taking A and B strategies into consideration, stars evaluation of the similarity to any other value is always positive. For C and D variants, stars are treated as stars-so they evaluate as positive only when comparing with other stars. Those strategies bring up the following granulation definition: In the following variants, we see the process of granule formation, where we use two basic options when comparing descriptors * = each value and * = * . Obviously, in the * = each value variant, the granules increase their size, i.e., after approximation they absorb potentially more damage values.
Considering ob 1 as the center of the granule, ob 2 as the compared object of the training system, r as the indiscernibility degree of the descriptors, I ND as the indiscernibility relation, and d as the decision attribute, we have considered the following options of internal processes for repairing unknown values. For readability, we have placed the legend of the applied markings in Table 1.
for I ND defined as where & means AND, means OR.

Testing Session
This section is describing the experimental part followed by presentation of the results. The effectiveness was calculated on an artificially damaged datasets (10 percent of the data has been replaced with stars) chosen from UCI Repository [30]. We perform the procedure five times, receiving the mean value from each test 5Cross_V5).

Verification of Results Stability
We have computed an additional parameter to show the bias of accuracy, defined as follows: for The classifier used for our experiments is a classical kNN, where the smallest summary distance of k-nearest objects indicates the decision parameter value. k parameters are estimated with the Cross_V5 method on a sample of data, which resulted in k = 5 for Australian Credit. k = 3 for Pima Indians Diabetes, k = 19 for Heart Disease, k = 3 for Hepatitis, and k = 18 for the German Credit data set. We have selected the kNN classifier for testing due to the fact that, in past tests, testing other granulation variants to absorb unknown values, we used the same classification variant as the base classifier. Our performance tests, NB classifier, kNN, SVM, and deep neural networks, showed that kNN is fully comparable with the best classifiers in the context of granular reflection based classification.

Overview of the Testing Results
The results of missing values absorption using concept dependent granulation are shown in Tables 2-6. For homogeneous granulation, please refer to Tables 7-11. As a conclusion of the research presented in [22], we can say that granulation is an effective technique of absorbing some degree of missing values placed in the dataset. Our observations were proved by comparable classification results with the non-missing values data case. We need to point out that granulation brings another important benefit-it can significantly (up to 80 percent) reduce the number of objects used for classification. As shown in [22], this behavior strictly depends on the diversity of used datasets. Using strategies A and B for lower values of granulation radius, the approximation is faster because the * = each value variant causes a higher number of objects in the granules. In case of * = * , stars can increase diversity of the data and consequently a higher number of granules containing fewer number of objects than in the * = each value case. Table 2. Missing values handling using conc_dep granulation technique; 5 × Cross_V5; A, B, C, D variants vs. nil case (classification based on original, undamaged training system); Australian Credit; synthetic 10% damage; radius = indiscernibility ratio; Bias_Acc = defined in Equation (1); Gran_Size = the number of training objects after granulation.   Table 3.
Missing values handling using conc_dep granulation technique; 5 × Cross_V5; A, B, C, D variants vs. nil case (classification based on original, undamaged training system); Pima Indians Diabetes; synthetic 10% damage; radius = indiscernibility ratio; Bias_Acc = defined in Equation (1); Gran_Size = the number of training objects after granulation.    (1); Gran_Size = the number of training objects after granulation.    (1); Gran_Size = the number of training objects after granulation.    (1); Gran_Size = the number of training objects after granulation.    (1); Gran_Size = the number of training objects after granulation.   (1); Gran_Size = the number of training objects after granulation. Comparing those results to the homogeneous granulation as a missing values absorption method, those gave the following findings. This technique is increasing the number of granules in the coverings-see Tables 7-11-and the indiscernability, in the context of decision classes, is lowering. This gives a higher probability of finding an object which breaks the homogeneity of the formed granule. Despite the fact that strategies A and B are returning smaller granules than in case C or D, the final granular reflection systems are bigger.
For given parameters, our methods work in a stable way, and the results are comparable to the nil case. A single run which is performed during the homogeneous granulation process is its biggest advantage, which might be the decisive factor when looking for the most robust method.
The results, showing our techniques using the strategy of completing unknown values with the most common values [31], can be found in Table 12. As we can see, they are equivalent to the results for the radius 1, in our strategies, where there is no approximation of training systems. Additionally, in Table 13, we have included degrees of homogeneity of the examined systems, i.e., the range of radii that appears during the homogeneous granulation process. Table 12. Missing values handling using the most common value strategy; 5 × Cross_V5; we consider repair options when * =each value, * = * and nil case (classification based on original, undamaged training system) synthetic 10% damage; Bias_Acc = defined in Equation (1)

Conclusions
Comparing concept dependent and homogeneous granulation as a missing values absorption technique, we can point to the following conclusions.
The * = each value variant used with concept dependent granulation generates more approximate datasets (diversity reduction) while the * = * case may increase the diversity. The granules are smaller for C and D strategies compared to the strategies A and B. Granulation of systems containing missing values reduces its size to a much higher degree than the granulation of undamaged datasets.
We can observe specific results when using homogeneous granulation as a missing values absorption technique. When comparing the results to the nil case-granulation of the undamaged dataset-granules in A and B strategies are smaller than those from C and D. It is happening because the * = each value case is breaking the homogeneity of the decision classes to a higher degree than the * = * case. The approximation level is decreasing for damaged datasets.
Granulation techniques are absorbing missing values in an effective way as confirmed by the classification results of the Cross_V model. The most missing values are repaired during the granulation process no matter which technique is being used.
In our research, we are going to choose the most effective technique among known classifiers for specific types of data. We also plan to implement and check effectiveness of homogeneous granulation in the context of classification based on deep neural networks.
Funding: This work has been fully supported by the grant from the Ministry of Science and Higher Education of the Republic of Poland under the project number 23.610.007-000.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.