Subgroup Preference Neural Network

Subgroup label ranking aims to rank groups of labels using a single ranking model, is a new problem faced in preference learning. This paper introduces the Subgroup Preference Neural Network (SGPNN) that combines multiple networks have different activation function, learning rate, and output layer into one artificial neural network (ANN) to discover the hidden relation between the subgroups’ multi-labels. The SGPNN is a feedforward (FF), partially connected network that has a single middle layer and uses stairstep (SS) multi-valued activation function to enhance the prediction’s probability and accelerate the ranking convergence. The novel structure of the proposed SGPNN consists of a multi-activation function neuron (MAFN) in the middle layer to rank each subgroup independently. The SGPNN uses gradient ascent to maximize the Spearman ranking correlation between the groups of labels. Each label is represented by an output neuron that has a single SS function. The proposed SGPNN using conjoint dataset outperforms the other label ranking methods which uses each dataset individually. The proposed SGPNN achieves an average accuracy of 91.4% using the conjoint dataset compared to supervised clustering, decision tree, multilayer perceptron label ranking and label ranking forests that achieve an average accuracy of 60%, 84.8%, 69.2% and 73%, respectively, using the individual dataset.


Introduction
Preference learning (PL) is an extended paradigm in machine learning that induces predictive ranking models from experimental data [1][2][3]. PL is applied to many different research areas such as knowledge discovery and recommender systems for learning the ranking [4]. Objects, instances, and label ranking are the three main categories of PL. Label ranking (LR) is a challenging problem that has gained importance in information retrieval by search engines [5,6]. Unlike the common problems of regression and classification, label ranking involves predicting the relationship between multiple label orders. Multi-label ranking problems are based on preference relations over a permutation space ω where each member of a group of k labels has a preference λ value, L = {λ 1 , λ 2 , ..., λ k }, where the differences of λ value represent preference relations ( , , , , ∼, ≺, ) [1,7]. However, real-world data can be ambiguous and often lack preference relations between two or more labels, and the missing relations can be mapped to an indifference ∼, or incomparability ⊥, relation [8,9]. These two relations create a partial order on the ω space where λ a ⊥λ b or λ a ∼ λ b . The partial relations are solved in terms of the relation between labels in one ω space in [10,11]. For example, π = (λ a λ b ∼ λ c λ d ) is mapped to π = (1, 2, 2, 3) and π = (λ a λ b λ c ⊥λ d ) is mapped to π = (1, 2, 3, 0). However, sometimes the data collected from the likes of recommender systems, elections, and surveys deviate from the population and in such cases label ranking cannot be predicted using the same learning model. Such a deviation is addressed by extracting patterns to identify the subgroup of data for the interesting targets using subgroup discovery (SD) approaches [12]. Subgroup discovery (SD) is descriptive induction data mining technique that discovers interesting associations among different variables with respect to a property of interest • Restricted label order π = (λ a λ b λ c λ d ) can be represented as π = (1, 2, 3, 4). • Non-restricted total order π = (λ a λ b λ c λ d ) can be represented as π = (1, 2, 2, 3), where a, b, c and d are the label indexes and λ a , λ b , λ c and λ d are the ranking values of these labels respectively.
The pairwise approach was first introduced by Hüllermeier, E. [25] to divide the label ranking problem into several binary classification problems in order to predict the pairs of labels, i.e., λ i λ j or λ j ≺ λ i for an input x. Cheng, W. and Hühn, J. proposed the instancebased decision tree to rank the labels based on predictive probability models of a decision tree [26]. Grbovic, M. combined both a decision tree and supervised clustering in two approaches for label ranking by mapping between instances and label ranking space [27]. The artificial neural network (ANN) for label ranking was first introduced as (RankNet) by Burges, C. to solve the problem of object ranking for sorting web documents from a search engine [28]. RankNet uses the Gradient descent and probabilistic ranking cost function for each object pair. The multilayer perceptron for label ranking (MLP-LR) [29] employs a network architecture using a sigmoid activation function to calculate the error between the actual and expected values of the output labels. However, it uses a local approach to minimize the individual error per output neuron by subtracting the actual predicted value and using Kendall error as a global approach. However, ranking error function was not used before in backpropagation (BP) and learning steps. The ranking methods mentioned above and their variants have some issues that can be broadly categorized into two types: • The ranking methods are based on probability and classification; thus, They do not learn the preference relation between labels divided into groups. • The ranking methods learn both unrestricted and restricted ranking labels using the same learning approach.
This paper proposes SGPNN as a tool to support the SD analysis to rank the discovered subgroup. In addition, SGPNN converts unrestricted label ranking to group of restricted labels and learn the groups of labels simultaneously using one model. The SG-PNN built upon preference neural network (PNN) to rank subgroup label data D ∈ { x n , (π n1 ⊥π n2 ...⊥π nm ) } where π n is a group of labels and m = number of subgroups. The primary motivation of this work is to build a unified predictive ranking model instead of having different models for different labels group.
The labels groups are employed in the following scenarios: 1.
Real customer data often explicitly rate different categories of products and services as multi-label subgroups, e.g., restaurant rating based on food quality and customer services [30].

3.
Multi-label data that have unrestricted preference relations between labels are converted into connected subgroups that have restricted relations. This can be seen in the sushi dataset [33,34] where λ a (λ b , λ c ) is solved by 2 subgroups using the . Another example of no ground-truth data where one data record has two labels π x = (λ a λ b ) and π x = (λ b λ a ) which are mapped to π x = (λ a λ b )⊥(λ b λ a ). The current challenge of the proposed SGPNN is the lack of datasets that represents the labels in a subgroup. Therefore, the datasets are synthesized from real data from single or multiple domains.
To sum up, the key contributions in this paper are: • Introducing a novel multi activation function neuron (MAFN) which uses multiple activation function where each function serve a group of output labels. • Ranking groups of label has incomparable/indifference relation simultaneously. • Discovering the hidden relation between different datasets by learning them together in one model is a novel approach to build an accumulative learning approach. • Solving the data ambiguity by removing the duplicated record which have different labels and marking the class overlap data with subgroup labels.

The Proposed SGPNN
This section gives an overview of the activation function, error functions, PNN and SGPNN architecture and its functionality.

StairStep (SS) Activation Function
The classical ANN activation functions have a binary output or range of values based on a threshold. However, these functions do not produce multiple values for different segments of the x-axis. The stairstep (SS) function is introduced to slow the effective learning rate around different rank values on the y-axis to solve the problem of ranking instability. The SS function is designed to be non-linear, monotonic, continuous, and differentiable by using a polynomial of tanh(x) function. The step width keeps the ranking during the forward and backward process stable.
Aizenberg, I. [35] proposed a generalized multiple valued neuron using convex shape to support complex numbers neural network and multi-values numbers. In addition, Moraga, C., and Heider, R. [36] introduced a similar function to design networks for realizing any multivalued function; however, Moraga, C. used exponential function derivative did not give promising results in the PNN implementation using the ranking objective function in FF and backpropagation (BP) steps. Each neuron has a multivalued SS activation function used to calculate the ranking between labels, s = n + 1 where s is the number of steps and n is the number of ranked labels. The SS has a fixed sharp stair-like edge to accelerate the convergence rate and provide multivalued output from −∞ to ∞ as shown in Figure 1. In order to be able to rank a large number of labels, the SS function effectively has a dynamic domain (on the x-axis), depending on a parameter b, to achieve adequate step width on the x-axis. Therefore, the input data are normalized from −b to b. We assume a heuristic rule of boundary value to capture the data range as b = 2n, where b is the geometric x-axis boundary. The SS activation function is given in Equation (1).
where c = 100 is a constant value chosen to create the sharp step edge, n is the number of ranked labels and SS is located between the geometry boundary −b and b on the x-axis. Each step represents a preference value on the y-axis from 1 to ∞. The incomparable relation between labels ⊥ is mapped to 0. As shown in Figure 2, the SS step horizontal segments are not an absolutely horizontal line but slope slightly to slow the changing rate around preference values. SS has been tested against other activation functions and it shows a ranking performance stability for complete and missing 60% of labels as shown in Figure 2a,b respectively. Figure 3 illustrates the graphical comparison between of Sigmoid and SS functions to rank stock dataset by summation the output weights for each neuron of middle layer. Sigmoid reaches from ρ = 0.3579 in 200 epochs to ρ = 0.7876 in 1600 epochs as shown in Figure 3a,b for ranking 5 labels. However, the SS function reaches from ρ = 0.4975 in 30 epochs to ρ = 0.8147 in 700 epochs as showing in Figure 3c,d using the same hyperparameters for ranking 5 labels.   (c) (d)

Error Function
Two main error functions have been used to measure the quality of ranking, Kendall's τ [37], and Spearsman's ρ [38]. This paper uses Spearman's ρ to train the PNN because Kendall's τ lacks continuity and differentiability. Spearman's ρ measures the relative ranking correlation between actual and target ranks, which is also more appropriate than the total square error because a low squared error does not necessarily mean a high ranking correlation between labels. We do not use the absolute difference of the root means square errors (RMSs) because the gradient descent may not decrease the ranking error. i.e., π 1 = (1, 2.1, 2.2) and π 2 = (1, 2.2, 2.1) have a low rms of 0.081 but a low ranking correlation ρ = 0.5 and τ = 0.3. We use the BP algorithm to train the PNN thus maximizing The Spearsman's ρ in Equation (2), and its derivative is used as the stopping criteria for the learning process.
where y i , yt i , i and n represent rank output value, expected rank value, label index, and number of instances, respectively.

One Middle Layer
The preference neural network (PNN) is a simple fully connected network with a single hidden layer which provides desirable ranking performance due to the SS activation function [39]. We performed experiments on 12 benchmark label ranking datasets [26] which show that increasing the number of hidden layers does not improve the performance, but rather it has adverse effects. This performance declined due to The SS's limited output variation that reduces the degrees of freedom when solving more complex problems. As mention by Lippmann, R. that three layers are sufficient to form arbitrarily complex decisions. [40], However, this is based on the current activation functions that have variations of output comparing to SS function.
PNN experimented using multi-hidden layers using benchmark data at KEBI repository [26]. The result showed decreasing ranking correlation by increasing the number of hidden layers, as shown in Figure 4.

Preference Neuron (PN)
A preference neuron (PN) is a neuron that has an SS activation function. The PN in the middle layer connects to only n output neurons (s = n + 1) where s is the number of steps and n is the number of output ranked labels. The middle and output PNs produce a preference value from 0 to ∞ as shown in Figure 5b where PN has n = 4. The number of output neurons is equal to the number of stair steps, as illustrated in the network architecture Figure 5b. However, the neuron has one output value per epoch, The Figure 5b shows n outputs connected to n neurons because SS has n stair steps values as presented in network architecture in Figure 5a.  The PNN ranks multi-labels by predicting the preference value for each output neuron by mapping the order to relative ranking around integer values from 1 to ∞ and 0 is mapped to incomparable ⊥ or undifferentiated ∼ relations. Each output neuron represents a label index as shown in Figure 5. i.e., when L = {λ a , λ b , λ c , λ d } and π = (d b c a), the output neurons will be π = (4, 2, 3, 1) or approximation values that make ρ 1, i.e., π = (3.9, 1.8, 3.1, 0.9) due to SS sharp edges. We use gradient ascent to maximize the Spearman ρ. a comparison with conventional FF-ANN is shown in Table 1. The architecture simplifies the learning process by eliminating the looping of the hidden layers. The FF, BP, and updating of weights (UW) are executed in two steps. Therefore, the batch weight updating technique does not apply to the PNN architecture, and pattern update is used in one step [41]. The network bias is low due to the limited neuron output variation. PNN is proposed for one group of label ranking. However, the architecture is not suited to rank different lengths of outputs. To rank different group sizes, a different SS function per group is required, which is not provided by the PNN.

SGPNN Architecture
This section describes the architecture of SGPNN and its functionality.

Multi Activation Function Neuron (MAFN)
The SGPNN introduces the multi activation function Neuron (MAFN) to address the architecture limitation of the PNN to rank different lengths of output layers. The MAFN contains the same number of inputs because they share the same w m weights with input neurons where w m is the weight of middle layer, y in = ∑ a i · w i . MAFN contains k number of ϕ activation function and lr learning rate, k = n, where n is the number of output layer. For example, Figure 6 shows a MAFN which has two ϕ, where each function has a single output; It is graphically represented by multiple #n output links because PN connects only to n number of output neurons where S = n + 1 and s is ϕ step number. As shown in Figure 6, ϕ 1 | n=4 and ϕ 2 | n=3 of the MAFN are connected to 2 output groups of 4 and 3 neurons, respectively.

Multiactivation Function Neuron (MAFN)
In a conventional ANN, the sufficient number of hidden neurons to achieve convergence is determined by the Cao and Mirchandani theorem [42]. In an n dimensional space, the maximum number of regions that are linearly separable into M regions using h hidden nodes is However, the SGPNN has multiple Euclidean n-spaces for each output layer. Therefore, m · n<k ma f n , where n is the n-dimensional Euclidean space and m is the number of spaces per each output layer.

SGPNN Functionality
The SGPNN is designed to address the architectural shortcoming of PNNs not being extendable by ranking label's groups separately. The SGPNN ranks different sizes of output layers while maintaining the single middle layer design. It has two types of neurons, PN and MAFN, which are used in the output and middle layers, respectively. The input layer represents one instance of data features. The middle layer has multiple MAFNs that use a separate learning rate and ϕ activation function for each output layer. The SGPNN is geometrically fully connected; however, FF, BP, and UW are functionally separated for each wo output layers' weights as illustrated in Figure 7. The weights of the MAFN are updated by the summation of all the δ m errors learning rate, ∑ k i=1 (lr i · δ mi ). Each output layer is a group of PNs that represent the ranked labels. The SGPNN scales up by increasing the number of MAFNs. Figure 8 illustrates examples of three subgroups architecture used for ranking emotions dataset where the first, second, and third group has 3, 1, and 4 labels respectively, to solve the problem π =(h p q)⊥(e)⊥(a b c d). The second subgroup has one label e that has three ranking values (1, 2, 3), which represent the preference values ( ,⊥,≺) between the two subgroups. The learning of the ranking process is executed in three steps; FF, BP, and UW. The learning stops after 20,000 epochs or Spearman's ρ reaches 1. A video demo that shows the ranking learning process using simple toy data are available at [43].

Data Preparation and Learning Algorithm
This section describes data combination, the ranking unification preprocessing and SGPNN learning steps (FF, BP and UW).

Conjoint Data
The Dataset is synthesized by concatenating the features and multiply the data point for each subgroup as shown in Equation (4).
where F i number of features per dataset i, ns is number of dataset and D i is number of data instance per dataset i.

Ranking Unification
We introduce a new method for creating label ranking ground truth by converting the unrestricted ranking to restricted ranking by unifying the data instances and adding subgroups to the labels. The percentage of a unique ranking is measured using Equation (5).
The number of subgroups is determined by the maximum number of repeated records using Equation (6) sg = Max(x r ) where sg is the number of subgroups and x r is the number of duplicated data records. This paper applies Algorithm 1 to convert the data from non-restricted rankings with no ground truth to unique groups of label ranking by removing duplicated data instances and accumulating the corresponding labels in a subgroup. The algorithm removes the duplication and assigns the corresponding labels as a subgroup to one unique data record. For non-repeated records, the additional subgroup has values of zero.

SGPNN Learning Steps
This section shows the FF, BP and UW processes in the middle and output layer of the SGPNN.

Middle Layer FF
The output of single MAFN connected to subgroup j is shown in Equation (7) where g is the number of subgroups, wm i is the weight of the middle layer of MAFN index i, x is the input value of MAFN, d is the number of input features, and ϕ j is the activation function per subgroup.

Output Layer FF
The output of single neuron in subgroup j is shown in Equation (8) where m is the number of MAFNs connected to subgroup j and wo ij is the weight of output layer of subgroup j and MAFN index i.

Output Layer BP
The output error δ oj of a single output neuron per subgroup j is given in Equation (9) where Error is the differentiation of Spearman correlation and activation function.
ϕ j is SS function per subgroup from Equation (1).
where δ oj is the error of output neuron and n is number of labels in subgroup j.
The δ oj in Equation (11) is obtained by differentiating of Equation (10) and substituting the result into Equation (9)

Middle Layer BP
The output error δ m is calculated in Equation (13).

Output Layer UW
The process to update the weights using gradient ascent with sums of δ o is shown in (15) where lr j is the learning rate for subgroup j and y ij is the input multiply by wo from middle layer of index i of MAFN to the subgroup j.

Middle Layer UW
Updating the weights of the middle layer is shown in Equation (16) (16) where y i is the input multiply by wm from input layer of index i of input neuron.

Dropout Regularization
We apply dropout as a regularization approach to enhance the SGPNN validation performance to reduce over-fitting using 50% probability. The process assigns a random number from −0.9 to 0.9 and stop using the weights with less than 0.5 of the random value per iteration for wo and wm.

Datasets
The SGPNN is experimented on both real-world and semi-synthesized (s-s)/conjoint datasets. The real data have multi-label subgroups for one set of features, e.g., restaurantfood-services. The s-s data are collected from different domains. The features from the same domain have small variations, e.g., the German elections dataset has examples of a relevant subgroup where features are collected from the same context. We examined the data uncertainty by measuring the percentage of U π unique multi-label ranking. Given that d is the amount of the data, The description is presented in Table 2. The restaurant-food-services dataset is built using actual food quality and customer service reviews from the recommender systems domain [30] and contains multi-label subgroups. The features of this dataset are customer profiles and geographical location. The two subgroups are food quality and customer service, and each subgroup has 130 multi-label, representing the number of restaurants. To simplify the calculation, we use part of the data containing 5, 10, and 20 restaurants for the two groups in three small datasets and select the corresponding features records of users' profiles who rated these restaurants.

German Election in 2005 and 2009
The german-2005/9 is an s-s conjoint dataset from two real datasets based on German election in 2005 and 2009 [31,32]. The multi-label of the two datasets is grouped into two label subgroups. However, the 2009 data used features to rank both 2005/9 labels because 2009 features have historical data and user profiles for the 2005 election.

Emotions
The emotions dataset is used for subgroup preference relations( ,∼,≺). The original Emotion dataset is used to detect six types of emotions based on listing to different type of music where the music belongs to many to one or many emotion types. The original dataset has six classes (amazed/surprised, happy/pleased, relaxing/calm, quiet/still, sad/lonely, angry/fearful). The data are modified by creating two subgroups. Music reflects both Positive feelings for (amazed-surprised, happy-pleased, relaxing-calm, quiet-still) and the Negative feelings for (sad-lonely, angry-fearful) [44]. Table 3 shows the heuristic rules applied for the preference relation between positive and negative feeling subgroups based on the subgroup labels' ranking. The ranking of sub-labels starts from 1 to 3. 1-3 represents the ranked value from 1 to 3.

Irrelevant Subgroups Data
We create a new hypothetical conjoint dataset from three different domains (biology, chemistry, and trades) for preference mining analysis to study data similarity and measure the SGPNN performance against other ranking approach. The conjoint data are collected from the benchmark and well-known multi-label ranking datasets from different domains specifically; iris, wine, and stock [26] to compare the performance of these data as subgroups with previous approaches that experimented with those datasets as a single problem.

Label Ranking Benchmark Dataset
The sushi [33,34] is a multi-label the dataset that has an unrestricted multi-label ranking as some identical data features have different multi-label rankings. The unrestricted ranking is converted into a restricted subgroup of multi-label for each instance of the data by removing the duplicated features and assign the labels for each repeated instance as a subgroup to a unique feature. Creating unique instances reduces the number of instances from 5000 to 4825 instances. Therefore, the maximum number of repeated instances is three, which means that the dataset has three subgroups. The instances that have unique second or third subgroups have zeros values.

Results
For the experiments, the datasets are divided randomly into the ratio of 80:20, 80% for training and validation and rest 20% for testing. Further 5-fold cross validation is adopted for 80% of training and validation to reduce the variance due to creation of data from different sources. We use sequential search by saving the best results' hyperparameters after five-fold cross-validation. The hyperparameters are the scale factor from −b to b, where b is the SS boundary value, learning rate, and the number of iterations is 1000 epochs and learning rate. the validation is reduced to two-fold cross validation for unrelated data to reduce the variance, i.e., wine-iris-stock. This configuration is used for evaluating both the PNN and the SGPNN. The results are presented in Table 4 Table 4 shows the testing results of the models after 5000 epochs. We compare the single ranking PNN, and SGPNN with other multi-label ranking for iris-wine-stock dataset in terms of Kendall's τ in Table 5. The SGPNN results are the ranking of each dataset as a subgroup with the other two datasets.   Table 4.

Non-Relevant Subgroup Data
The results of the training data of conjoint iris, wine, and stock are illustrated in Figure 9b by SGPNN comparing to ranking them separately using PNN, in additional to the state-of-the-art methods of testing data as shown in Table 5. It is noticed that SGPNN outperforms the other label ranking methods; supervised clustering [27], supervised decision tree [26], multilayer perceptron label ranking [29], and label ranking tree forest (LRF) [45] that rank iris, wine, and stock, respectively. Ranking the three datasets (wineiris-stock) together gives a higher ranking than even ranking every two datasets (wine-iris), (iris-stock), or (wine-stock) using the same hyperparameters as shown in Table 4.

Ranking Enhancement
The results show that learning the labels as a subgroup from a relevant domain enhances each group's ranking compared to ranking them separately. This enhancement in ranking is almost due to sharing the network weights of two or more problems. The sharing weights accelerate the convergence, similar to reinforcement learning. This paper proposes a novel learning method to rank multi-label subgroups to support the analysis of SD. This approach is a part of the broader sphere of reinforcement learning to learn from multiple data sources and build a conjoint unified learning model. The computation time may increase by increasing the number of subgroups and higher rank accuracy; however, SGPNN deliver a unified ranking model with a higher convergence rate and high testing accuracy.

Convergence Fluctuation
The dataset wine-stock and iris-stock take a longer time for convergence due to data separability and complexity; thus, convergence for each group of labels is not linear. This non-linearity creates fluctuations more than the ranking of a single label group. These fluctuations are not related to the gradient error in ranking, but it is the average ranking between two subgroups as each subgroup tends to increase the ranking, it updates its weights which reflect on the shared weights, which may reduce the convergence of the second group. The fluctuation is shown in the video link of convergence of two groups using toy dataset [43]. The convergence fluctuations are not noticed when we use three subgroups together, i.e., the iris-wine-stock dataset using the same hyper-parameters of two subgroups SGPNN.

Potential Applications
SGPNN could be used in many potential applications, i.e., brain-computer interface (BCI) applications where EEG data may have ambiguity, complicated, and unbalanced. Another medical application is where data fusion is collected from different sensors, i.e., the study of human emotions recognition. SGPNN could be part of an expert system to build accumulated learning model for judgment, elections, medical diagnosing from different conjoint historical data.

Conclusions and Future Works
The SGPNN is a new step in preference learning to predict the subgroups from conjoint data by proposing a simple three layers FF network that has different outputs to build the conjoint model from a different group of data. This paper introduces a simple network with one middle layer and a new activation function to speed up the learning to rank using the new Spearman objective function. This paper introduces the novel MAFN to serve more than one group of labels. In addition, creating conjoint data from multiple datasets reinforce the learning to rank and enhance accuracy. The proposed network with one middle layer simplifies the process of FF, BB and UW in three steps for middle and output layer comparing to the conventional ANN.
The future work of SGPNN is to coupling the relation with different SD methodologies to rank the subgroup. The data used in the experiment are relatively tiny; thus SGPNN opens a road to develop a deep learning network based on MAFN, PNN, Spearman error function, and SS function to accelerate the learning to build a more complicated conjoint model. The SGPNN integrates with SD to study the relations, similarity, and separability from different domains to have a shared learning model.

Data Availability Statement:
The data presented in this study are openly available in [43].