1. Introduction
Clustering is an unsupervised learning technique in which a set of unlabeled data points is partitioned into clusters to reveal intrinsic structures and patterns within the data [
1]. It is widely applied in fields such as image segmentation [
2], speech recognition [
3], and text mining [
4], and can be categorized into various types, including hierarchical clustering, partition-based clustering, density-based clustering, model-based clustering, grid-based clustering, graph-based clustering, and fuzzy clustering.
However, the results of a single clustering method may vary due to differences in algorithms and initialization strategies or the risk of falling into local optima. To address this, ensemble clustering was proposed [
5,
6]. The core idea of ensemble clustering is to combine multiple clustering results, exploiting their diversity and complementarity to produce more stable, accurate, and robust final clustering outcomes. Early studies employed simple voting methods to aggregate cluster labels for each data point [
7,
8]. The effectiveness of these approaches was limited by their inadequate consideration of base classifier fusion strategies, which included a lack of evaluation and weighting of base cluster quality. Furthermore, these methods failed to capture the intrinsic local structures of the data.
A popular recent approach is the co-association matrix method, which constructs a symmetric co-association matrix from the frequencies with which data points appear in the same cluster across base clustering results, as co-occurrence is a symmetric relation. A clustering algorithm is then applied to the co-association matrix to generate the final partitioning [
9,
10,
11,
12,
13,
14]. The symmetry of this matrix arises naturally from the reciprocal nature of co-occurrence relationships: if sample
i is clustered with sample
j, then
j is also clustered with
i. Since the co-association matrix reflects pairwise similarities between samples, it can serve as a similarity or adjacency matrix.
Clustering methods based on similarity matrices are then applied to the co-association matrix, yielding ensemble clustering results through graph partitioning [
15]. Clustering methods based on graph partitioning have been extensively studied [
16,
17,
18,
19]. To further enhance the expressive power of the co-association (CA) matrix and improve the stability of clustering results, Bian et al. proposed a fuzzy-neighbor-based CA matrix learning method called EC–CA–FN [
20]. This method simultaneously considers both intra-cluster and inter-cluster relationships among samples and introduces a fuzzy index to control the degree of fuzziness, thereby significantly improving the robustness and stability of clustering performance. In addition, Bai et al. addressed the limitation of traditional k-means in identifying nonlinearly separable structures by proposing a multiple k-means ensemble algorithm [
21]. By extracting locally credible labels and constructing a cluster relationship graph, their approach enables linear clusterers to simulate nonlinear partitioning, expanding the applicability of ensemble clustering to more complex data structures. For large-scale data, Zhang et al. introduced a fast spectral ensemble clustering algorithm called FSEC, which is based on anchor graphs [
22]. By performing spectral embedding only once and leveraging anchor-based graph construction along with singular value decomposition (SVD) for dimensionality reduction, the algorithm significantly reduces computational cost while maintaining high clustering efficiency on large datasets. Deep ensemble clustering methods have also gradually gained popularity. For example, Hao et al. [
18] proposed ECAR, an ensemble clustering method that incorporates attention mechanisms and joint optimization strategies, demonstrating the potential of combining deep learning and ensemble clustering. In multi-view data contexts, Huang et al. [
23] introduced FastMICE, which focuses on scalability and efficiency for large-scale data.
Although ensemble clustering has shown promise in improving clustering quality, its results can exhibit implicit unfairness, a problem that has been largely overlooked. In recent years, machine learning technologies have played an important role in data-driven decision-making. However, traditional algorithms may introduce biases when handling data involving sensitive attributes (e.g., gender, race, or age), leading to systemic neglect or unfair treatment of certain groups. This undermines the trustworthiness of algorithms in real-world applications and risks exacerbating societal inequalities [
24,
25]. Designing fair clustering algorithms has thus become an important research area.
The practical urgency of fair clustering is evident in applications like resume screening or credit scoring, where algorithmic grouping may reinforce societal biases if sensitive attributes are ignored. Ensuring fairness in such unsupervised learning stages is therefore critical for developing trustworthy and equitable machine learning systems. There are two main concepts of fairness in clustering: group and individual fairness. In this study, we focus on group fairness, in which the goal is to ensure proportional representation of sensitive groups while maintaining clustering quality. This prevents systemic neglect of any group [
26]. For instance, Chierichetti et al. proposed the fairlet decomposition method, which partitions data into small, fair “fairlets” for clustering to meet group fairness constraints. However, this method has a high computational complexity. Building on this, Backurs et al. [
27] developed a near-linear-time fair clustering algorithm that significantly improves efficiency for large-scale datasets. Bera et al. [
28] extended k-means clustering by incorporating upper- and lower-bound constraints into the optimization objective, enabling flexible control over group proportions. In spectral clustering, Kleindessner et al. [
29] enforced group fairness by adding linear constraints to a Laplacian matrix optimization objective, ensuring that the group proportions in clustering results closely match their overall distribution while maintaining low RatioCut or NCut values, which indicate a high clustering quality. Ghadiri et al. [
30] proposed a cost-balance-based fairness concept that minimizes clustering cost disparities across groups.
Individual fairness, in contrast, emphasizes that similar individuals should be treated similarly in clustering results, such as by imposing constraints on pairwise distances between samples [
31]. Although some fairness studies have focused on single clustering algorithms, their methods are often specific to a given algorithm and difficult to generalize. Ensemble clustering offers a more flexible framework by synthesizing results from distinct base clustering models without accessing raw data features. However, fairness in ensemble clustering has been underexplored.
Unlike prior fair clustering methods that enforce fairness within a single clustering model, our approach embeds fairness directly into an ensemble consensus process, where multiple base clusterings are aggregated under fairness constraints. The only existing study on fair ensemble clustering, by Zhou et al. [
32], introduced fairness regularization into the consensus objective to balance fairness with clustering quality, assuming equal capacity for each cluster.
To address scenarios where cluster sizes may naturally vary, we build upon the parameter-free ensemble framework of Nie et al. [
6] and incorporate the proportional fairness notion from Chierichetti et al. [
26]. Our method ensures that the distribution of sensitive attributes within each cluster aligns with the global distribution, without imposing equal cluster capacity. A key feature of our formulation is the consistent use of a symmetric similarity matrix, which directly encodes pairwise sample relationships. This symmetric representation better preserves the relational structure within the data and promotes stable, efficient convergence during the consensus optimization. The key innovation lies in the joint optimization framework that simultaneously learns this fair similarity matrix
A and a label matrix
Y, embedding fairness directly into the consensus process and avoiding post-processing. This enables dynamic, fairness-aware clustering that adapts to both data structure and group proportions.
The main contributions of this study are as follows:
We address fairness in ensemble clustering, making it applicable to diverse clustering scenarios and improving the robustness and accuracy of clustering results.
We propose a fast iterative optimization framework based on coordinate descent to solve the optimization problem.
We present experimental results demonstrating that our method effectively balances clustering quality and fairness, validating its effectiveness.
2. Related Work
2.1. Notation
Here, we define some notation used throughout this paper. The transpose of a matrix X is denoted by , and represents the weighted adjacency matrix or similarity matrix. The Laplacian matrix is given by , and is the clustering indicator matrix with elements, where if the i-th sample belongs to the j-th cluster, otherwise, . The binary indicator matrices for each base clustering (BCs) are denoted as ,with the number of partitions for each BC represented by . For the l-th BC, the dimension of is , and the true number of clusters in the dataset X is c. The balance matrix for each partition is denoted as , which is a diagonal matrix containing the elements , where .
2.2. Ensemble Clustering
Ensemble clustering, also known as consensus clustering, aims to enhance clustering robustness by integrating multiple BCs. Traditional clustering methods, such as k-means and spectral clustering, often suffer from sensitivity to initialization and parameter settings. By aggregating the outputs of diverse BCs, ensemble clustering mitigates the biases of individual models and improves overall performance.
A common approach is to construct a co-association matrix that records how often data points are clustered together across different BCs. However, this method assumes equal contributions from all BCs, ignoring differences in their quality and reliability. To address this limitation, weighted ensemble clustering frameworks assign varying importance to BCs and clusters. However, many existing methods use fixed weights determined by predefined metrics, limiting their adaptability.
A more advanced approach is to introduce a parameter-free dynamic weighting mechanism that learns a structured similarity matrix A directly from multiple BCs [
6]. Unlike traditional methods that rely on a co-association matrix, this framework dynamically adjusts the weights of BCs and clusters to reflect their quality. The weight
of a given BC is computed on the basis of its consistency with the learned matrix
A, while the cluster-level weights
address cluster imbalances within each BC. From Theorem 1 in [
6], to ensure that
A has exactly
c connected components, thereby allowing the samples to be partitioned into
c classes, it can be shown that the rank of
should be equal to
. Therefore, a rank constraint is imposed on the learned similarity matrix
A.
The objective function is formulated as:
Here,
represents the label matrix of the
l-th BC, and
is defined as:
To optimize the objective, we perform iterative updates. The rank constraint on
is non-convex and difficult to solve directly. Therefore, we transform this constraint during the optimization process. Since
is a positive semi-definite matrix, its eigenvalues are non-negative. The rank constraint can be converted into the condition that the
c smallest eigenvalues of
are zero. When optimizing
A while fixing other variables, the problem can be written as:
Here, the regularization parameter
should be sufficiently large. According to Ky Fan’s theorem [
33], we have the following equality:
where
F is an orthogonal
matrix. This transformation allows us to replace the non-convex rank constraint with a trace minimization over an orthogonal matrix, which can be efficiently optimized. Therefore, the problem can be rewritten as:
The alternating optimization for updating A and F is carried out next. A more detailed exposition of this algorithm and optimization process has been presented by Xie et al. [
6]. By iteratively updating
A,
, and
, the framework adapts to the diversity and quality of the BCs. This parameter-free ensemble clustering approach dynamically balances the contributions of BCs and clusters, achieving superior performance across diverse datasets and establishing itself as a robust and efficient clustering solution. A more detailed exposition of this algorithm and optimization process has been presented by Xie et al. [
6].
2.3. Fairness
In this work, we focus exclusively on group fairness, which ensures that each cluster’s sensitive group distribution aligns with the global distribution. Individual fairness, which would require similar treatment for similar individuals, is not enforced here, as our objective is to prevent systemic bias against protected groups at the cluster level. This aligns with real-world applications such as demographic-balanced resource allocation.
Group fairness refers to the principle that groups with different sensitive attribute values should receive similar treatment in model decisions. Following the fairness definition proposed by Chierichetti et al. [
26], we formalize fairness as follows. Assume a dataset
X is divided into
w protected groups
, where each group corresponds to a specific sensitive attribute value. Furthermore, the data are clustered into
c partitions
. The proportion of group
in the overall dataset is given by:
The proportion of group
within the
v-th cluster is:
The fairness of cluster
can then be expressed as:
Here, represents the allowable deviation, which defines the acceptable range of fairness. To achieve fairness, it is required that the distribution of the protected group within each cluster aligns closely with its global distribution.
Unlike single-model fair clustering methods, our approach enforces fairness at the ensemble level. Moreover, instead of imposing equal cluster sizes, we require each cluster’s sensitive attribute distribution to match the global proportion. This proportional fairness is integrated directly into the joint optimization of the similarity matrix and cluster labels, ensuring fairness is intrinsic to the consensus process.
3. Proposed Algorithm
In
Section 2.2, the ensemble clustering algorithm of Nie et al. [
6] was discussed, with Equation (
1) describing the learning of a similarity matrix
A from multiple BCs. In our proposed algorithm, a fairness objective function is added so that the similarity matrix and the clustering result are fair.
3.1. Fair Ensemble Clustering
We assume that the dataset
V consists of
w groups, i.e.,
. Following the fairness concept introduced in
Section 2.3, we ensure that the distribution of different groups
within each cluster matches the global distribution. Consider the vector space
which represents the group matrix. This binary matrix is defined as:
We then construct the following formulation to enforce fairness:
where
represents the label matrix of the final clustering result. This fairness term penalizes deviations from the global group distribution in each cluster, acting as a soft proportional constraint.
Assume , where . The element in matrix B represents the number of points with the i-th sensitive attribute in the j-th cluster. Multiplying B by the inverse of the matrix yields a result that indicates the proportional distribution. The matrix is constructed to reflect the global distribution of different groups; each column of U is the same vector u, and represents the proportion of the i-th group in the entire dataset. By adjusting Y, we ensure that the distribution of different groups in each cluster matches the global distribution represented by U, thereby achieving fairness in the clustering results.
The proposed fairness constraint is formulated using the label matrix
Y to ensure that the proportions of different sensitive attributes in each cluster are accurately represented, thereby making the difference with the global distribution matrix more reflective of fairness. Additionally, using the label matrix
, we ensure the data are divided into
c clusters without needing to further constrain the rank of the Laplacian matrix in Equation (
1).
From Equation (
1), we only learn the similarity matrix
A; the label matrix is derived from
A in a post-processing step. However, we need to learn the 0–1 label matrix
Y directly. To achieve this, inspired by NCUT [
34], we add the term
to learn the label matrix
Y from the similarity matrix
A.
Based on this idea and the existing ensemble clustering framework introduced in
Section 2.2, we propose the following objective function:
Here,
are Lagrange multipliers that control the trade-off between clustering quality and fairness. A larger
places greater emphasis on satisfying the proportional fairness constraint, which corresponds to enforcing a smaller allowable deviation
in Equation (8).
3.2. Optimization
We sequentially update A, Y, , and , keeping the other variables fixed when updating each one. We initialize and . The variables Y and , are randomly initialized.
3.2.1. Updating A
The subproblem can be formulated as:
Referring to the integrated algorithm for updating
A mentioned in
Section 2.3, the above problem can be rewritten as:
where
,
, and
is the
i-th column of
G. It can be observed that the problems for different values of
i are independent. Thus, for a specific
i, the problem can be written as:
where
, and
and
are the
i-th columns of
A and
, respectively.
Let
; then the above problem can be transformed into:
To solve this problem, we use the Lagrange multiplier method proposed by Nie et al. [
6], and the final solution is:
where
and
. Here,
is a Lagrange multiplier and
. The solution for
is
, where
is the root of the following equation solved using the Newton’s method:
where
and
.
3.2.2. Updating Y
The subproblem for updating
Y can be formulated as follows:
Here, Y represents the clustering label matrix, where each data point belongs to exactly one cluster. Consequently, each row of Y has only a single element equal to 1, with all other elements 0. Using this property, we employ the coordinate descent method to update Y, treating each row as an independent variable. When updating the i-th row, all other rows remain unchanged. The update involves iterating over all possible cluster assignments, setting one column to 1 while others are set to 0, and selecting the assignment that minimizes the objective function.
3.2.3. Updating
The augmented Lagrangian multiplier method is used to update
, as detailed by Nie et al. [
6]. The updated value of
is the solution to the following augmented Lagrangian function:
where
,
, and
is the penalty coefficient. The update step for
is given by:
The optimization problem for
is:
The solution process follows a similar approach to the previous problems, as detailed in Algorithm 1.
| Algorithm 1 Augmented Lagrangian Multiplier (ALM) Method for Optimizing |
Input: Base clusterings , co-associate matrix A, initialize , and , set . Output: The optimal solution of .
- 1:
while Not convergence do - 2:
Fix and update according to Equation (20); - 3:
Fix and update by solving problem (21); - 4:
Update by ; - 5:
Update by ; - 6:
end while
|
3.2.4. Updating
The update step for
follows the formula in Equation (
2). The detailed algorithmic process is summarized in Algorithm 2.
| Algorithm 2 The Algorithm of Solving Problem (12) |
Input: Base clusterings . Output: The structured graph A and the consensus clustering results.
- 1:
Initialize A,Y, and ; - 2:
while Not convergence do - 3:
Update A by solving Equation (13); - 4:
Update Y by solving Equation (19); - 5:
Update by Algorithm 1; - 6:
Update according to Equation (2); - 7:
end while
|
3.3. Refining
Using coordinate descent to update
Y directly results in a computational complexity of
. To address this inefficiency, Nie et al. [
35] proposed a faster coordinate descent approach for solving the NCUT problem. We augment their method with an improved update strategy for Y to reduce its computational complexity. The terms in Equation (
1) can be rewritten as follows:
Thus, the subproblem for updating
Y can be reformulated as:
The update for the
r-th row, denoted as
, can be represented as:
When updating the r-th row while keeping the other rows fixed, there are c possible update strategies, denoted as , where is a one-hot vector with the k-th element set to 1. The r-th row is updated to , and the other rows are fixed as . When updating the r-th row, computations related to the other rows become redundant, so the expression can be simplified to:
where
is constant term.
To further optimize the computational efficiency, we can precompute the terms , , , and . Assuming the original label of the r-th point is p, we now consider the following two scenarios:
- (1)
When
, we have
and
. Therefore,
Thus, the objective function becomes:
Since the terms on the right-hand side of Equation (
28) have already been precomputed, the time complexity of calculating this expression is
.
- (2)
When
, we have
and
. Then,
and
Thus, the objective function can be written as:
Similarly, the time complexity of computing Equation (
31) is
.
After updating the
r-th row, it is necessary to update the stored values of
,
,
, and
. If the label of the
r-th point remains unchanged, the stored values are not changed. However, if the label changes, let
, where
. In this case, the corresponding values need to be updated as follows. We can derive that
and
. Consequently, the expressions involving
are computed as follows:
These two terms have already been computed when evaluating the objective function and do not introduce additional computational complexity. For the term involving
, we have
This computation has a complexity of
. Similarly, for the term involving
P:
This operation has a complexity of . Since , the total computational complexity for these four operations is .
For the expressions involving
, the computational complexity follows the same pattern as those involving
:
These updates maintain the same computational complexity, ensuring efficient evaluation of the required terms. The redefined method is summarized in Algorithm 3.
| Algorithm 3 The Algorithm for Quickly Solving Problem (12) |
Input: Base clusterings . Output: The structured graph A and the consensus clustering results.
- 1:
Initialize A,Y, and ; - 2:
while Not convergence do - 3:
Update A by Equation (13); - 4:
Calculate , , , , , , then store them in the memory; - 5:
Repeat: - 6:
for to n do - 7:
for to c do - 8:
Calculate via Equations ( 28) and ( 31) and store - 9:
end for - 10:
Update the label of point r by choosing the index of the minimum value of - 11:
Update the memory of , , , - 12:
end for - 13:
Until convergence - 14:
Update by Algorithm 2; - 15:
Update according to Equation (2); - 16:
end while
|
3.4. Time Complexity Analysis
The time complexity of the proposed method can be determined as follows. The computation required to initialize and is negligible. When updating A, the time complexity of computing T is , whereas for U, it is . Thus, the overall complexity of computing A is . To update Y, the fast coordinate descent method is used. The initialization involves computing , , , and , whose respective complexities are , , , and . At each point operations are required to evaluate the objective function for each potential solution. When a better solution is found, the corresponding label update involves computing and , each with a complexity of . Let be the number of outer iterations, and let denote the number of improved solutions. Then, the time complexity of updating Y is given by . Under the conditions and , this complexity simplifies to . The time complexity of updating and are and , respectively. Thus, the overall time complexity of our method is , where t represents the number of iterations.
3.5. Convergence Analysis
The proposed method employs an alternating optimization approach in which, at each step, a subproblem involving the current variable is minimized. Let the value of the objective function after the
t-th update be
. It follows that
which implies that the objective function does not increase during any optimization step; thus, its value is monotonically decreasing.
The objective function can be divided into three terms. The first term, , and the third term, , both represent the squared Frobenius norm, and hence are non-negative. The second term, , is also non-negative.
Proof. Let
denote the Laplacian matrix, which is defined as
, where
D is the degree matrix. The Laplacian matrix
is a positive semi-definite matrix, meaning all its eigenvalues are non-negative. Therefore, for any non-zero vector
, we have
Since
Y is the clustering label matrix, for any vector
, we observe that
Thus, is also a positive semi-definite matrix. Additionally, is a diagonal matrix whose diagonal elements correspond to the sum of degrees in each cluster. Since D is a positive definite diagonal matrix, is also positive definite; therefore, its inverse square root, , is positive definite as well.
Given these properties, the matrix
is positive semi-definite, as its eigenvalues are non-negative. Consequently, the trace of
Z is also non-negative:
where
are the eigenvalues of
Z.
Thus, the objective function has a lower bound:
As a result, the objective function cannot decrease indefinitely, and each variable update process is convergent. Therefore, we can conclude that the algorithm converges. □