On a Hybridization of Deep Learning and Rough Set Based Granular Computing

: The set of heuristics constituting the methods of deep learning has proved very efﬁcient in complex problems of artiﬁcial intelligence such as pattern recognition, speech recognition, etc., solving them with better accuracy than previously applied methods. Our aim in this work has been to integrate the concept of the rough set to the repository of tools applied in deep learning in the form of rough mereological granular computing. In our previous research we have presented the high efﬁciency of our decision system approximation techniques (creating granular reﬂections of systems), which, with a large reduction in the size of the training systems, maintained the internal knowledge of the original data. The current research has led us to the question whether granular reﬂections of decision systems can be effectively learned by neural networks and whether the deep learning will be able to extract the knowledge from the approximated decision systems. Our results show that granulated datasets perform well when mined by deep learning tools. We have performed exemplary experiments using data from the UCI repository—Pytorch and Tensorﬂow libraries were used for building neural network and classiﬁcation process. It turns out that deep learning method works effectively based on reduced training sets. Approximation of decision systems before neural networks learning can be important step to give the opportunity to learn in reasonable time.


Introduction
This paper is divided into parts dedicated to deep learning, rough sets, granular computing by means of rough mereology. Deep learning as a collection of techniques and is rooted in artificial neural networks (ANN's) [1][2][3]. The idea of a neural network is that of an acyclic directed graph whose nodes are computing units-neurons-joined by edges labelled with weights. Nodes with input degrees of zero are called inputs while nodes with the output degree zero are said to be outputs. The flow of information is forward: from input nodes to output nodes. Nodes are classified into layers, each layer defined recurrently starting from the input nodes layer. Exemplary Computation by a neural net begins with the input vector and in the simple case of a sigmoidal perceptron, with the input x, the output is given as f (x), f being an activation sigmoidal function. The result of the computation is the vector output by the output layer of neurons.
The learning procedure for ANN's is a series of computations on sequences of training vectors x i which stops when the output vector is sufficiently close to the target vector on each input vector. Theoretically justified by the Perceptron Learning Theorem [4], the method of learning by changing weights by the delta rule turned effective when the backpropagation technique came into usage [5]. Deep learning proceeds further by enhancing the neural net with many filters allowing for exhibiting of many local features. Some variants like LSTM allow for reaching deep back into memory of the process which makes such networks especially effective in, for example, speech processing [6]. For a general introduction please consult [7].
Interesting research on the field of granular computation with the use of neural network techniques can be found in the works [8][9][10]. To the best of our knowledge, there is no similarity to our research in this context, so direct comparison is difficult. In addition, the aim of our work is to check whether the data prepared by our granulation techniques are learned through deep neural networks. It is not our goal to show that we have the best technique to reduce training systems.
Rough set theory [11] approaches data in set-theoretical terms by assuming that on each collection of vectors representing some objects, a partition is obtained, its classes representing distinct concepts/categories pertaining to those objects. A general concept, i.e., a set of objects in a given collection (the universe) is perceived through categories: some concepts can be expressed by categories in a deterministic way and some may not. The former are exact concepts (modulo the given partition into categories) while the latter are inexact (rough) concepts. Each rough concept can only be expressed in terms of its relation to categories by approximations: the lower approximation of a concept consists of categories (or, exact concepts) contained in the concept whereas the upper approximation consists of categories intersecting the given concept.
A means for dealing with data is provided by the notion of an information system (see Pawlak, op.cit.) which is a tuple (U, A, V, f ) where U is a universe of objects, A is a set of attributes, V is a set of attribute values, and, f is a mapping which assigns to each object x ∈ U and each attribute a ∈ A, the value f (x, a) ∈ V. Categories obtained in this case are classes of the indiscernibility relation I ND B (x, y) = true i f and only i f a(x) = a(y) for each a ∈ B, where B is a subset of the set A of attributes. A special case of information systems is a decision system with signature (U, A, V, f , d) where d is a new attribute not in A, called the decision. A relation between sets I ND B and I ND d for some B is called a decision algorithm over B. For algorithmic methods of inducing decision rules please see [12,13]. A far reaching extension of rough set theory is rough mereology [14]. Rough mereology applies as its primitive notion that of a part to a degree [15]. Parts to a degree are subjected to a few basic restrictions which reflect properties of partial containment: each object is a part to itself to the degree of 1, if an object x is a part to a degree of 1 to an object y then for each object z, the degree to which z is contained in x is not greater than the degree to which z is contained in y. Rough mereology in turn was applied in a formal definition of granules of knowledge [16,17]. Formally, given a measure m of partial containment (called in Polkowski-Skowron a rough inclusion), a granule g(x, r) of the radius r about an object x is the collection of all objects which are parts of x to degrees of at least r. Consult [15] for a deeper discussion of computing with granules. On the basis of computing with rough mereological granules, an approach to data mining was proposed [16]. This approach consists in transforming a given decision system (data set) (U, A, V, f , d) into a granular decision system (G, A , V, f , d , r) where: G is a set of granules of radius r about objects in U which provides a covering of U; A is a set of attributes a for a ∈ A, each a maps each granule g into the value set V according to the formula a (g) = S(a(u) : u ∈ g) where S is a selected strategy like e.g., majority voting with random tie resolution; V is the value set unchanged; f (a , g) = a (g); d is defined in the same manner as a . To the granular decision system (U, A , V, f , r) any standard algorithm for rule induction can be applied for all plausible values of the radius r. This ends our introduction of the main ingredients in our approach. In the following sections , we give details of our approach and we present results.
In the work we are focusing on deep learning effectiveness on the reduced decision systems, we check the level of internal knowledge maintenance in terms of classification effectiveness.
The rest of the paper has the following content. In Section 2 there is a detailed description of granulation technique used in experimental part. In Section 3 we have described the artificial network architecture. In Section 4 we present the experimental part. We conclude our work in Section 5.

Reducing the Size of Decision-Making Systems Based on Their Granular Reflections
As a reference technique, we have chosen one of our best methods for the approximation of decision systems (concept-dependent granulation), which works analogously to the baseline procedure described in this section, while granule formation takes place separately within decision classes.
Granulation consists in reducing the size of the training decision-making system through the process of creating granular reflections of data.
The definition of the concept-dependent granule formulation is in the Section 2.2.
Let's move on to the basic technique. Our methods are based on rough inclusions. Introduction to rough inclusions in the framework of rough mereology is available in [16,18]; a detailed, extensive discussion can be found in [15].
In the Polkowski's granulation procedure, we can distinguish three basic steps.
1. First step: granulation. We begin with computation of granules around each training object using selected method. In the method used in this article, by surrounding the objects of the training system class with objects indiscernible to the degree determined by the granulation radius. 2. Second step: the process of covering. The training decision system is covered by selected granules.
After the calculation of granules in point 1, a group of granules that cover the entire training system with their objects is searched for. 3. Third step: building the granular reflections. The granular reflection of original training decision system is derived from the granules selected in step 2. We form new objects by converting granules using majority voting.
We start with detailed description of the basic method-see [16].

Standard Granulation
For the sake of simplicity we use the following definition of decision system, it is triple (U, A, d), where U is the universe of objects, A the set of conditional attributes, d ∈ A is the decision attribute, and r gran granulation radius from the set {0, 1 |A| , 2 |A| , . . . , 1}. The standard rough inclusion µ, for u, v ∈ U and for selected r gran is defined as where For each object u ∈ U, and selected r gran , we compute the standard granule g r gran (u) as follows, In the next step we use selected strategy to cover the training decision set U by computed granules-the random choice is the simplest among the most effective studied in [19]). All studied methods are available in [19] (pp. 105-220).
In the last step, granular reflection of training decision set is computed with use of Majority Voting procedure. The ties are resolved randomly.
The process of granulation can be tuned with help of the triangular part of granular indiscernibility

Concept Dependent Granulation
A concept-dependent (cd) granule g cd r gran (u) of the radius r gran about u is defined as follows: For the granulation radius r gran = 0.25, the granular concept-dependent indiscernibility matrix (gcdm) is shown in Table 2.
hence, the granules in this case are Considering the random choice, the covering can be {g cd 25 (u 6 )} The concept dependent granular decision system formed from coverage is in Table 3. Table 3. Concept dependent granular decision system for (U, A, d) and radius 0.25.
The majority voting was applied only into the granule g cd 0.25 (u 6 ).

Design of the Experimental Part
A general scheme of the experimental part design is shown in in Figure 1. The neural network architecture was chosen experimentally. For each tabular dataset used we have run our experiment using the same network architecture to make it more comparable. We conducted 10 series of tests for each system tested. Since the data sets used are small in size, we have built a network with simple architecture. In addition, we have selected sets that have two decision classes at the output.
In subsequent experiments, the diversity of sets will be increased. Figure 1. The diagram shows a scheme of our experimental part. The exact design of the neural network is in Figure 3. The data that is fed into the neural network is normalized after granulation to a range of < 0, 1 >.
Our network consists of an input layer, two hidden linear layers and an output layer. Input layer is the only layer which size is dynamical and depends on the dataset used. First hidden layer consists of 30 neurons and the second one of 20 neurons. Output layer consists of only two neurons as the decision classes are binary in all used datasets.
We used a hyperbolic tangent as an activation function in layer 1 and 2 and a softmax function in layer number 3. Our network is using an Adam optimizer and a learning rate equal to 0.001. To calculate the value of the loss function Cross Entropy was used. Each iteration is being performed across 500 epochs.

Procedure for Performed Experiments
General scheme of the test carried out and detail neural network architecture we have on the

The Results of Experiments
In this section we show the exemplary results for our selected technique, to show the effectiveness of deep learning in classification based on reduced training data. We have the results for Monte Cross Validation 10 method for selected data sets (see Table 4) from UCI repository [20]. The internal knowledge from the original training decision systems-measured by ability for classification-seems to be preserved in sufficient way (the accuracy of classification is comparable with nil case, without reduction). The nil case is for radius 1.
The results of the experiments showed the usefulness of learning neural networks on granular data.
Additionally to accuracy of classification vs the percentage of training size (the size after granulation) for 10 iteration of learning (see        . Results for 10 learning cycles, using 10 splits; for Australian credit data set converted to dummy variables (after conversion to dummy variables its 35 attributes); In 'percentage of objects' ax, we have the percentage size of granulated data vs accuracy of classification in 'Accuracy' ax; in. The results are not perfectly evenly matched or at the same points on the x-axis, due to the fact that the size reduction levels of the training systems varied. Figure 10. Dummy variables -mean result for 10 learning cycles, using 10 splits; for Australian credit data set; after conversion to Dummy variables its 35 attributes.  When considering the results for Heart Disease data set (see Figures 5 and 6 and Table 6), for a radius of 0.756, with a reduction in the number of training objects of up to 67 percent, we get an accuracy of 0.756 compared to the original system of 0.825. For a radius of 0.643, with a reduction of nearly 42 percent, we get an accuracy of 0.781, while for a radius of 0.714, where granulation reduces 16 percent of the objects, we get a 0.823.
In the case of results for the Australian Credit system (see Figures 3 and 4 and Table 5), within a radius of 0.6 we get a reduction of about 70 percent and a classification accuracy of 0.813 compared to 856 percent on the original system. In the case of 0.667 radius, with a reduction of 41 per cent, we get an accuracy of 0.84. In case of radius 0.733, with reduction in training system size in the range od 14 percent, we have accuracy 0.853.
In our next experiment for the Pima Indians Diabetes system (see results in Figures 7 and 8 and  Table 7), for a radius of 0.333 and a reduction in the number of objects of about 73 percent, we get an accuracy of 0.7 compared to 0.772 on non-granular data. In case of radius of 0.444 and a reduction in the number of objects of about 38 percent, we get an accuracy of 0.756 and finally for radius 0.55 with 11 percent reduction we obtained accuracy 0.772.
As an additional result, we added the learning effect of on the Australian credit data set after the conversion of its symbolic attributes to Dummy variables. From the results of Figures 9 and 10 and Table 8 we see that the classification after the conversion is comparable. And exemplary result for radius 0.775, with 83 percent reduction in training set size, accuracy is equal 0.789. In case of radius 0.885, accuracy is 0.817 with 55 percent reduction. For radius 0.85, with 31 percent reduction, we have reached accuracy 0.832. For radius equal 0.875 with 11 percent reduction in training size, accuracy is 0.839. In the nil case for dummy variant accuracy is equal 0.839.
Despite the fact that the classification results are not the best among the techniques we have previously applied to granular data (among others Naive Bayes classifier [19]), SVM [21], Rough set based classifiers [19]), we are pleased that neural networks are able to maintain high classification efficiency by working on granular data. We treat the results as a trailer for future intensive research on the application of granular computing techniques in the context of learning neural networks.
The Tables 5-8 present more information about our experiments. As an explanation please refer to the list of column labels and their explanation: • gran_rad-granulation radius as a percentage value, • no_of_gran_objects-number of new objects in tested decision system after the granulation process, • percentage_of_objects-percentage of objects in tested decision system comparing to the primary decision system size, • time_to_learn-time that was needed to complete the learning process using given data, • accuracy-classification accuracy for given neural network.

Conclusions
This paper contains results that show how usage of granular reflections of decision systems can be used in deep learning. For experimental purpose we have selected the most effective method among the studied concept dependent variant and performed learning on selected data from the UCI Repository based on the tensorflow library. It turned out that the designed neural network works on approximated data in effective way, when measured in classification accuracy. Patterns contained in the granulated data seem to be preserved in the neural network structures. Experiments from our work have shown that our approximation techniques for tabular decision making systems can be an effective pre-processing step before learning with deep neural networks. Reduced data, while retaining internal knowledge, gives the opportunity for faster learning of networks. In future works we are planning to check the set of neural networks architectures to use with our approximation methods. We are considering the use of granular structures in the convolutionary part of the preparation of data for learning by means of neural networks.
Funding: This work has been fully supported by the grant from Ministry of Science and Higher Education of the Republic of Poland under the project number 23.610.007-000.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.