Distributed Support Vector Ordinal Regression over Networks

Ordinal regression methods are widely used to predict the ordered labels of data, among which support vector ordinal regression (SVOR) methods are popular because of their good generalization. In many realistic circumstances, data are collected by a distributed network. In order to protect privacy or due to some practical constraints, data cannot be transmitted to a center for processing. However, as far as we know, existing SVOR methods are all centralized. In the above situations, centralized methods are inapplicable, and distributed methods are more suitable choices. In this paper, we propose a distributed SVOR (dSVOR) algorithm. First, we formulate a constrained optimization problem for SVOR in distributed circumstances. Since there are some difficulties in solving the problem with classical methods, we used the random approximation method and the hinge loss function to transform the problem into a convex optimization problem with constraints. Then, we propose subgradient-based algorithm dSVOR to solve it. To illustrate the effectiveness, we theoretically analyze the consensus and convergence of the proposed method, and conduct experiments on both synthetic data and a real-world example. The experimental results show that the proposed dSVOR could achieve close performance to that of the corresponding centralized method, which needs all the data to be collected together.


Introduction
Many real-world data labels have natural orders that are usually called ordinal labels. For example, fault severity in industrial processes is usually divided into {harmless, slight, medium, severe}. Ordinal regression, which aims at predicting ordinal labels for given patterns, has attracted a great deal of research in many fields, such as disease severity assessment [1], satisfaction evaluation [2], wind-speed prediction [3], age estimation [4], credit-rating prediction [5], and fault severity diagnosis [6]. Although classical classification and regression methods can be applied to the ordinal regression problem [7,8], they require additional prior information about the distances between labels. Otherwise, they often perform unsatisfactorily since they cannot fully use ordering information [9,10].
To tackle the aforementioned problems of classical classification and regression methods, many ordinal regression methods were proposed [10]. Among them, the most popular type of approaches are threshold models, which assume that a continuous latent variable underlies the ordinal response [10]. In threshold models, the order of the labels is represented by a set of ordered thresholds. These ordered thresholds define a series of intervals, and the data label depends on the interval the corresponding latent variable falls into. Among the threshold models, support vector ordinal regression (SVOR) [11,12] is widely used because of good generalization performance. A representative work is the support vector ordinal regression with implicit constraints (SVORIM) proposed in [11,12]. This determines each threshold by taking all the samples into consideration, where the threshold inequality constraints can be satisfied without explicit constraints.
Most of the existing ordinal regression methods have been developed in a centralized framework. However, in practice, data used for ordinal regression may be distributed in a network [13]. Each node of the network collects and stores part of the data, and it is not enough for a single node to train a model with good performance. For instance, in industrial processes, sensors are often used in factories to monitor the operating status of equipment and diagnose fault severity. Due to the rarity of faults, a single sensor can only collect very few data, and the faults encountered by each factory may also be different. To train a proper model, we need to use as many data as possible. However, in some realistic scenarios, it is difficult for data to be transmitted to a central node for various reasons [13]. For example, factories may not want to leak data regarding their equipment in order to protect privacy. Moreover, if the data are collected by image sensors or video sensors, it may be difficult for a single machine to store and process such a large amount of data. In such situations, centralized methods are inapplicable, and distributed methods are more suitable choices.
In this paper, we propose a distributed support vector ordinal regression algorithm based on the SVORIM method to deal with more complex nonlinear problems in distributed ordinal regression. First, we formulate a constrained optimization problem for SVORIM in the distributed scenarios. Classical methods usually solve the problem by transforming it into the dual problem. In distributed circumstances where the original data cannot be transmitted to others, it is difficult for classical methods to calculate the kernel function values and optimize the dual variables because they require data from different nodes. Thus, we adopted a random approximation method and the hinge loss function to transform the optimization problem to overcome the above difficulties. Increasing the number of random approximation dimensions can improve the approximation accuracy, but brings redundancy. In order to find an appropriate number of approximation dimensions, we further added a sparse regularization term of the approximation dimension number to the objective function. Through the above steps, we transformed the original problem into a convex optimization problem with consensus constraints. Then, to solve the problem, we propose a subgradient-based algorithm called distributed SVOR (dSVOR) where each node only uses its own data and the parameter estimates exchanged from its neighbors. To verify the effectiveness of dSVOR, we theoretically analyze its consensus and convergence, and conducted some experiments on synthetic data and a real-world example. The experimental results show that the proposed distributed algorithm under additional constraints could achieve close performance to that of the corresponding centralized method, which needs all the data to be collected to a central node.
The main contributions of this paper are summarized as follows.

1.
Existing work on distributed ordinal regression [14] uses a linear model; therefore, it cannot deal with the problems of linearly inseparable data. We extended the SVOR method to distributed scenarios to solve distributed ordinal regression problems with linearly inseparable data.

2.
We developed a decentralized implementation of SVOR, and propose a dSVOR algorithm. In the proposed algorithm, the kernel feature map is approximated by random feature maps to avoid transmitting the original data, and sparse regularization is added to avoid excessively high approximation dimensions.

3.
The consensus and convergence of the proposed algorithm are theoretically analyzed.
The rest of this paper is organized as follows. In Section 2, we introduce related works. The ordinal regression problem and the SVORIM method are introduced in Section 3 as preliminary knowledge. In Section 4, we formulate the distributed support vector ordinal regression problem, propose the dSVOR algorithm, and perform theoretical analysis of the proposed algorithm. Experiments were conducted to evaluate the effectiveness of the proposed algorithm and they are presented in Section 5. Lastly, in Section 6, we draw some conclusions.

Related Works
Ordinal Regression Methods. Many ordinal regression methods have been proposed to solve ordinal regression problems. The ordered logit model [15,16] makes assumptions about the distribution of the prediction error of the latent variable, and uses the cumulative distribution function to build the label cumulative probability function. The support vector ordinal regression (SVOR) [11,12] maximizes margins between two adjacent labels. Variants of SVOR with nonparallel hyperplanes were discussed in [17,18]. There are also ordinal regression methods that solve ordinal regression problems by solving a series of binary classification subproblems. In [4,19], extended labels were extracted from the original ordinal labels to learn a binary classifier (such as support vector machine [19] or logistic regression [4]); then, a ranking rule was constructed from the binary classifier to predict ordinal labels. In [20], the authors used the stick-breaking process to construct a series of binary classification subproblems to guarantee that the cumulative probabilities were monotonically decreasing. However, the above ordinal regression methods are all centralized and are infeasible in distributed scenarios.
Distributed methods. Distributed methods were extensively studied in many fields, such as distributed estimation [21,22], distributed optimization [23,24], distributed clustering [25], distributed Kalman filter [26], and distributed anomaly detection [27]. However, as far as we know, there are few works investigating distributed ordinal regression [14]. In [14], the authors proposed a distributed generalized ordered logit model, which is a linear model and therefore cannot handle complex problems.

Ordinal Regression Problem
The classification problem aims at classifying the K-dimensional input vector x ∈ X ⊆ R K into one of Q discrete categories y ∈ Y = {C 1 , C 2 , . . . , C Q }. The ordinal regression problem is a type of classification problem in which the data labels have a natural order C 1 ≺ C 2 ≺ · · · ≺ C Q , where ≺ is an order relation [10]. The purpose of ordinal regression is to find a mapping function f : X → Y to predict the ordinal labels for new patterns given a training set of N samples D = {(x i , y i ), i = 1, . . . , N}.

Support Vector Ordinal Regression with Implicit Constraints
Let φ(x) denote the feature vector in a high-dimensional reproducing kernel Hilbert space (RKHS) of input vector x. The inner product in the RKHS is defined by the reproducing kernel function: Support vector machines construct a discriminant hyperplane in the RKHS by maximizing the distance between support vectors and the discriminant hyperplane. The discriminant hyperplane is defined by an optimal direction w and a single optimal threshold b. It divides the feature space into two regions for two classes.
The support vector ordinal regression constructs Q − 1 parallel discriminant hyperplanes for Q ordinal labels where these hyperplanes are defined by optimal direction w and Q − 1 thresholds {b q } q=1,...,Q−1 . The ordinal information in the labels is represented by T was used to denote these thresholds.
In [11,12], the SVORIM method determined a threshold b q by utilizing the samples of all the labels. For threshold b q , each sample belonging to C p , ∀p ≤ q should have a function value less than b q − 1; otherwise, ξ Similarly, each sample belonging to C p , ∀p > q should have a function value greater than b q + 1; otherwise, ξ * q As proved in [11,12], this approach has the property that the threshold inequalities can be automatically satisfied after convergence without explicitly including the corresponding constraints. This method is called support vector ordinal regression with implicit constraints and is formulated as follows: where C is a predefined positive constant. The above problem can be solved by solving the dual problem, which can be derived with standard Lagrangian techniques. Let β q pi ≥ 0, γ q pi ≥ 0, β * q pi ≥ 0, and γ * q pi ≥ 0 be the Lagrangian multipliers for the constraints in the above equation. The dual problem is the following maximization problem [11,12].
For a new pattern x, SVORIM calculates the function value w · φ(x) and then decides its category according to the interval the function value falls into, where the intervals are defined by thresholds {b q } q=1,...,Q−1 .

Network and Data Model
In this paper, we consider a network consisting of M nodes. We could use a graph G = (M, E ) to represent this network. It consisted of a set of nodes M = {1, 2, . . . , M} and a set of edges E . Each edge (m, n) ∈ E connected a pair of distinct nodes. We used N m = {n|(m, n) ∈ E } to represent the set of neighbors of node m ∈ M.
Data used for ordinal regression are distributedly collected and stored by the M nodes of this network. The i-th sample of node m is represented as (x m,i , y m,i ), where x m,i ∈ X and y m,i ∈ Y. More specifically, at node m, the total number of samples is N m , the number of samples that belong to C q is N q m , and the i-th sample of C q is denoted as (x q m,i , y q m,i ). Figure 1 shows a schematic of a distributed network. In distributed networks, due to limited storage, computation and communication resources and the need for privacy protection, node m can only transmit some parameters θ m instead of the original data to its neighbor nodes in N m , and perform local computation using only its own data {(x m,i , y m,i )} 1≤i≤N m and the parameters exchanged from its neighbors. Each node should eventually obtain a model consensus with that obtained by other nodes, and the performance of the model should be close to that of the model trained using all the data.

Problem Formulation
In centralized SVOR, the objective is to find an optimal direction w and a vector b.
If the data from all the nodes of the distributed network can be collected together, then parameters θ = {w, b} can be obtained by solving Problem (1).
In distributed situations, data are not allowed to be transmitted to a central node. Each node can only use its own data and some parameters from its neighbors. In this case, each node m has a local estimate θ m of θ. With a connected network, we imposed constraints θ m = θ n , ∀(m, n) ∈ E to ensure the consensus of {θ m } m=1,...,M . Then, the corresponding optimization problem in distributed scenarios can be written as follows: where ξ q m,pi is the empirical error of x p m,i for b m,q when p = 1, . . . , q and ξ * q m,pi is the empirical error of x p m,i for b m,q when p = q + 1, . . . , Q. With the help of the consensus constraints, this problem is equivalent to Problem (1).

Problem Transformation
In classical solutions, a primal problem is solved by solving the corresponding dual problem. Applying such methods to Distributed Problem (3) is confronted with two major difficulties: 1.
For nonlinear kernel functions, the dimension of the RKHS is unknown, and we can only calculate the inner product of φ(x m,i ) and φ(x n,j ) rather than them. Because the data are distributed in various nodes of the network, the kernel function K(x m,i , x n,j ) requiring data from different nodes is difficult to calculate without transmitting the original data.

2.
The dual variables of samples should satisfy constraints in (2). In the distributed scenarios, the dual variables of the first constraint in (2) are usually from different nodes. Since each node is only allowed to exchange information with its neighbors, it is difficult to optimize these dual variables.
To overcome the first difficulty, we use a random approximate function [28] z : R K → R D , where D > K, to map the data to a D-dimensional space instead of RKHS. In this study, for Gaussian kernel function where ψ i is drawn uniformly from [0, 2π], and ω i is drawn from the Fourier transform of Gaussian kernel function As proved in [28], if dimensional number D is large enough, z(x) T z(x ) can approximate K(x, x ) well, and z(x) can approximate φ(x) well. According to Cover's theorem [29], a complex pattern-classification problem nonlinearly cast in a high-dimensional space is more likely to be linearly separable than it is in a low-dimensional space. Therefore, to ensure good performance, we should set a relatively large D. For other shift-invariant kernels such as Laplacian and Cauchy, the authors in [28] provided corresponding finitedimensional random approximate functions. For additive homogeneous kernels, such as Hellinger's, χ 2 , intersection and Jensen-Shannon, the authors in [30] also provided efficient finite-dimensional approximate mapping functions. For a linear kernel function, random approximation is not necessary, so we defined With the random approximation, mapping function φ(x) in (3) is replaced by z(x). The calculation of z(x) only requires one data point from a single node instead of a pair of data from different nodes like the kernel function, so the first difficulty is solved.
After the random approximation is performed, the data are mapped into a D-dimensional feature space instead of the RKHS with unknown dimension. Thus, we could directly solve the primal problem instead of the dual problem, which automatically tackles the second difficulty.
With the use of hinge loss function L(x) = max(1 − x, 0) [31], the problem can be rewritten as follows:

Sparse Regularization
In the above steps, a D-dimensional random approximate function z(x) is used to approximate the unknown mapping function φ(x). In general, a large D can lead to small approximation error and good classification performance. However, an overlarge D may cause redundancy, which wastes storage space, and brings high computational complexity and high communication costs. There is a trade-off between the above two aspects, so we added a sparse regularization term. The regularization term pushes some dimensions of w m to 0, which means that these dimensions are redundant and can be discarded. When some dimensions of w m converge to 0, these dimensions do not need to be calculated, stored and transmitted.
The l 0 -norm is typically used to measure sparsity. However, it is nonconvex, and l 0 -norm-based problems are NP-hard. In practice, we can use the l 1 -norm as a convex approximation of the l 0 -norm. Introducing the l 1 -norm into the objective function in (7), we obtain where α ∈ [0, 1] controls the proportion of the l 1 -norm sparsity regularization term in the entire regularization term. A larger α can lead to a sparser solution of w m . Therefore, since we set a relatively large D to ensure good performance, we could set a relatively large α to reduce redundancy. We could view this problem from another perspective. If the last two terms in (8) are regarded to be the objective function, the first two terms combined together can be seen as a similar penalty to the elastic net penalty in [32], where α measures the weight of the l 1 -norm penalty term.
After the above steps, we transformed Problem (3) into a convex optimization problem with consensus constraints (8).

Distributed SVOR Algorithm
In this subsection, we propose the dSVOR algorithm to solve Problem (8). First, we used the following notation for convenience which is a convex function. The calculation of J m (θ m ) does not need the data and estimated parameters from other nodes. Then, Problem (8) can be rewritten as follows: To deal with consensus constraints θ m = θ n , ∀(m, n) ∈ E , we adopted the penalty function method. The penalty function used in this paper is θ m − θ n 2 , and the corresponding positive penalty coefficient is λ mn . Then, the optimization problem becomes The larger the λ mn is, the closer the solutions of Problems (11) and (10) are.
We then applied the subgradient method to optimize Problem (11). For the hinge loss function L(x) = max(1 − x, 0), we adopted the following subgradient: and for the l 1 -norm, we adopted At step k + 1, the iterative equation is where η k is the step size in step k + 1, which is positive. The specific subgradients are In the subgradient method, in order to converge to the optimal solution, step size η k should satisfy [33] +∞ ∑ k=0 η k = +∞, and We can rearrange Iterative Equation (14) as follows.
If we use the following notations for convenience the iterative equation can be rewritten as It can be divided into two steps, i.e., a combination step and an adaption step: In Combination Step (21), node m combines the parameters estimated by its neighbors and itself to obtain an intermediate estimate φ k m , where the combination coefficient of node m and its neighbor n is denoted as c mn . In Adaption Step (22), node m uses the subgradient calculated by using only its own data to update θ m .
Combination coefficients {c mn } ∀(m,n)∈E represent a cooperation rule among nodes. Equation (19) was not used to define {c mn } because λ mn was not defined in advance. In distributed algorithms, combination coefficients are generally determined by a certain cooperative protocol. In this study, we used the Metropolis rule [34]: where |N m | denotes the degree of node m, and where C is an M × M matrix whose entries are defined by (23). Equation (19) shows that λ mn = c mn 2η k .
The whole processes of dSVOR are summarized in Algorithm 1.  (22). end for end for Remark 1. In the above problems, φ(·) is a nonlinear mapping function that maps input x into a RKHS for classification, and input x is the original data or extracted features. In general, function φ(·) can also be regarded to be a generalized feature mapping function that extracts features of x, and maps x into a feature space for classification. Thus, it can also use an artificial neural network with learnable parameters. However, that may destroy the convexity of the problem, so that it is no longer guaranteed to converge to the global optimum.

Theoretical Analysis
In this subsection, we theoretically analyze the consensus and convergence of dSVOR. We first introduce a reasonable assumption that is needed in analysis. According to [34], when the graph is not bipartite, this assumption can be guaranteed.
where C is the combination coefficient matrix set as in Equation (23).
Then, we give two theorems about consensus and convergence each.
Theorem 1 (Consensus). If Assumption 1 holds, and step size η k satisfies Condition (17), then Theorem 2 (Convergence). If Assumption 1 holds, and step size η k satisfies Condition (17), then For the proof, see Appendices A and B for details.

Experiments
In this section, we carry out experiments on synthetic data and a real-world example to demonstrate the performance of the proposed dSVOR algorithm.
We implemented the following algorithms for comparison: 1.
centralized SVOR (cSVOR), which relies on all the data available in a central node; 3.
distributed SVOR with a noncooperative strategy (ncSVOR). In ncSVOR, each node uses only its own data to train a model without any information exchanged with other nodes.
All the algorithms were implemented using the PyTorch framework [35]. There are three points to emphasize: 1.
The centralized method needs data in a central node. For comparison, we artificially collected all the data distributed in the nodes of the network together to render it applicable, which is impractical in reality.

2.
In cSVOR [11,12], problems were solved by the SMO algorithm instead of subgradientbased algorithms, so we only display its final results.

3.
The distributed algorithms were subject to additional constraints, so a distributed algorithm is generally satisfactory if it can achieve comparable performance to the corresponding centralized algorithm.
In this study, we used the prediction accuracy (ACC) and mean absolute error (MAE) on the testing set as the performance evaluation metrics. ACC is a commonly used metric in classification problems, but it does not consider the ordered information of the labels. MAE is the mean absolute deviation of the predicted rank from the true one, which is commonly used in ordinal regression. Using a function O(·) to denote the position of a certain label in the ordinal scale, i.e., O(C q ) = q, q = 1, . . . , Q, we have The performance of distributed algorithms (dSVOR and ncSVOR) is defined as the mean performance of models obtained by each node. The distributed algorithms ran on a randomly generated connected network that consisted of 20 nodes. For fair comparison, on a certain dataset, all implemented algorithms used the same parameters. All the results were obtained by averaging the results of 10 independent experiments.

Synthetic Data
In this subsection, we evaluate the performance of all algorithms on two synthetic datasets. On the first dataset, samples could be separated by a set of parallel straight lines if ignoring noises, and samples of the second dataset could be separated by a set of concentric circles. Figure 2a,b show some samples of these two datasets from one of the 10 independent experiments. Both datasets had 1200 samples: 1000 were used as the training set, and the others were the testing set. The training samples were randomly assigned to 20 nodes to simulate the situation where the data were collected and stored by these nodes in a distributed manner. These two synthetic datasets were generated with the following methods. For the first dataset, we generated 1200 samples with uniform distribution from a rectangular area 2,3 to divide this area into 4 parts for 4 classes. The data labels were determined by their locations. Then, Gaussian noise with 0 mean and σ 1 standard deviation was added to each dimension of input vector x = [x 1 x 2 ] T . After that, these samples were rotated around the origin with β. Without loss of generality, in the experiments, these parameters were set as follows: For the second dataset, we generated 1200 samples with uniform distribution from a circle x 2 1 + x 2 2 < R 2 , which could be divided into four parts by three concentric circles The data labels were determined by their locations. Then, Gaussian noise with 0 mean and σ 2 standard deviation was added to each dimension of input vector x = [x 1 x 2 ] T . Without loss of generality, the parameters were set to be R = 4, R 1 = 1, On the first dataset, we used a linear kernel function. In all methods, positive constant C was set to be 1000/N, where N is the number of samples of all nodes. Because the feature space was only 2-dimensional, the sparse regularization term in our method was not necessary. Thus, we set the coefficient of sparse regularization term α = 0. In the distributed algorithm, we used the following diminishing step size: which satisfied Condition (17). In (26), parameter η 0 determines the initial step size, and τ determines the decreasing rate of the diminishing step size. We empirically set η 0 = 0.1 and τ = 0.01 in the following experiments. Figure 3a,b show the ACC and MAE curves of different algorithms on the first synthetic dataset. As time increased, the MAE of our dSVOR algorithm decreased, and the ACC increased significantly. After about 500 iterations, the dSVOR algorithm converged to a value that was almost the same as that of cSVOR, while the result of ncSVOR was still some distance away from them. This means that it was not enough for a single node to train a model with good performance using its own data. The proposed dSVOR algorithm, which uses the local data of each node and the parameter estimates from neighbor nodes, could achieve a similar performance to that of the corresponding centralized method.   Figure 4 gives the parameters of each node estimated by different algorithms. In the ncSVOR algorithm, the estimated parameters obtained by different nodes were quite different. Thus, the model obtained by each node with its own data was quite different from the model trained using all the data. In contrast, the estimated parameters of different nodes in dSVOR were almost the same as the parameters in cSVOR. This illustrates the consensus of the proposed dSVOR algorithm. Because we used a linear kernel function here, optimal direction w in the centralized method had an explicit expression that allowed for us to compare it with the estimates of the distributed algorithms. In the following experiments using nonlinear kernel functions, we do not give the results about consensus. On the second dataset, we used a Gaussian kernel function. The kernel size was set to be σ = 1 K after Z-score normalization, where K is the dimension of input space. In all methods, positive constant C was set to be 1000/N. As analyzed before, in our method, we set a relatively large D and a relatively large a, D = 200, α = 0.9. α was not set to 1 because we wanted to use the strong convexity of the l 2 -norm regularization term to increase the convexity of the objective function, which is theoretically beneficial to the optimization of the problem. The learning rate parameters were still set to be η 0 = 0.1 and τ = 0.01. Figure 5a,b show the ACC and MAE curves of different algorithms on the second synthetic dataset. The proposed dSVOR algorithm was able to obtain almost the same result as that of the centralized method, while ncSVOR could not. We also conducted experiments under different hyperparameters D and α to show the parameter sensitivity of dSVOR. Figure 6 gives the MAEs of dSVOR for different D when α was fixed as 0.9. As D increased, the performance of dSVOR gradually improved and was eventually almost the same as that of the centralized method. With a relatively large approximation dimension D ≥ 100, dSVOR could always obtain a similar MAE to that of cSVOR. However, as mentioned before, an overlarge D may cause redundancy. So, when using a large D to ensure good performance, it is better to use the sparse regularization term to reduce the redundancy. Figure 7a,b gives the MAEs of dSVOR and the proportions of dimensions of w m that were equal to 0 for different α when D is fixed as 200. The MAE was stable under different α, but the sparsity of w m was greatly affected by α. A small α led to a dense w m , which caused a lot of redundancy. A large α could bring a sparse w m , where the dimensions that converged to 0 could no longer be stored, calculated, and transmitted after converging to 0, thus saving storage, computation, and communication resources.

A Real-World Example
We now take the distributed fault severity diagnosis of rolling element bearings as a real-world example to illustrate the effectiveness of dSVOR.
Rolling element bearings are widely used in factory equipments. The fault severity diagnosis of bearings is a crucial task to ensure reliability in industrial processes. In recent years, data-driven methods have been widely used to identify faults and their severity [36]. To achieve good performance, these data-driven methods usually require a lot of data. However, due to the rarity of faults, a single sensor can only collect very few fault data, and the faults encountered by each factory may also be different. Thus, data from many sensors in many factories are needed to train a proper model. Sometimes, factories may not want to leak the data about their equipments, so it is not allowed to transmit the data to others. The centralized methods which need all the data available in a central node become inapplicable. The distributed methods become a better choice. Taking into account the ordinal information in the fault severity, it is suitable to apply the proposed dSVOR algorithm.
In this study, we used the rolling element bearings data provided by the Case Western Reserve University (CWRU) [37] for experiments. CWRU data were the vibration signals of drive end and fan end bearings collected by sensors at 12,000 and 48,000 samples/s under four different loads of 0-3 hp. There are three types of faults: outer race (OR), inner race (IR), and ball (B) faults, and each type has at most four severity levels (fault width: 0.18, 0.36, 0.53, 0.71 mm). In the experiments, we used drive end bearing data collected at 12,000 samples/s, and performed 4-level fault severity diagnosis in a total of 12 situations (3 different fault types and 4 different loads).
We adopted the feature based on permutation entropy (PE) proposed in [38] as the input x. For one datum, we intercepted a sequence of length 2400 from vibration signal data. This sequence was decomposed into a series of intrinsic mode functions (IMFs) by ensemble empirical mode decomposition (EEMD) with 100 ensembles and 0.2 noise amplitude to catch information on multiple time scales. Then, the PE values of the first 5 IMFs are calculated as the input feature of this piece of data.
For each fault severity level, we randomly took 300 training samples and 200 testing samples, and the samples in the testing set were different from those in the training set. For 4-level fault level diagnosis, there were a total of 1200 training samples and 800 testing samples. These training samples were randomly assigned to 20 nodes to simulate the situation where the data were collected and stored by these nodes in a distributed manner.
In the experiments, we used a Gaussian kernel function with kernel size σ = 1 K after Z-score normalization. In all methods, positive constant C was set to be 10,000/N. In our method, we still set a relatively large hyperparameter D and α, D = 200, α = 0.9. The other parameters used the same settings as before, i.e., η 0 = 0.1 and τ = 0.01. Table 1 shows the experimental results where the value was the mean ± standard deviation of 10 independent experiments. The performance of ncSVOR was worse than that of cSVOR because each node only had part of the training samples that were not enough to represent the entire training set to train a proper model. Compared to ncSVOR, the proposed dSVOR algorithm could achieve similar results to those of cSVOR. In dSVOR, each node can only use the data of its own and exchange some estimated parameters with neighbor nodes. It was satisfactory to be able to achieve performance close to that of the centralized method that uses all the data from all nodes. Taking the dataset of the IR fault type and 0 hp load as examples, we also show the results of dSVOR under different hyperparameters D and α in Figures 8 and 9. Figure 8 shows that, with a relatively large random approximation dimension D ≥ 100, dSVOR could obtain a similar MAE to that of cSVOR, which illustrates the effectiveness of the random approximation. Figure 9 shows that a relatively large α can lead to a sparse w m without affecting the MAE performance, thus effectively reducing redundancy.

Conclusions
When data are distributedly collected and stored by multiple nodes, and are difficult to transmit to a central node, existing centralized ordinal regression methods become inapplicable. To this end, in order to handle the ordinal regression problem in distribution scenarios, we extended the SVORIM to a distributed version, and derived a distributed SVOR (dSVOR) algorithm. In dSVOR, each node combines the parameters estimated by its neighbors and performs local calculations using only its own data. After convergence, each node can obtain a model whose performance is close to that obtained by the centralized method relying on all the data available in a central node. Theoretically, we analyzed the consensus and the convergence of dSVOR. Practically, we carried out experiments on synthetic data and a real-world example to illustrate its effectiveness.
In our future work, we intend to consider how to automatically determine the proper parameters in dSVOR, e.g., introducing multi-kernel learning to automatically find suitable parameters of random approximate. We also aim to design adaptive strategies for adjusting combination coefficients.