Federated Multi-Stage Attention Neural Network for Multi-Label Electricity Scene Classification

Zhong, Lei; Jiang, Xuejiao; Xu, Jialong; Zheng, Kaihong; Wu, Min; Gao, Lei; Ma, Chao; Zhu, Dewen; Ai, Yuan

doi:10.3390/jlpea15030046

Open AccessArticle

Federated Multi-Stage Attention Neural Network for Multi-Label Electricity Scene Classification

by

Lei Zhong

¹,

Xuejiao Jiang

¹,

Jialong Xu

¹,

Kaihong Zheng

^2,*,

Min Wu

¹,

Lei Gao

¹,

Chao Ma

¹,

Dewen Zhu

¹ and

Yuan Ai

²

¹

Hainan Power Grid Co., Ltd., China Southern Power Grid, Haikou 570100, China

²

Digital Grid Research Institute, China Southern Power Grid, Guangzhou 510663, China

^*

Author to whom correspondence should be addressed.

J. Low Power Electron. Appl. 2025, 15(3), 46; https://doi.org/10.3390/jlpea15030046

Submission received: 21 April 2025 / Revised: 11 June 2025 / Accepted: 19 June 2025 / Published: 5 August 2025

(This article belongs to the Special Issue Advances in Low Power Neuromorphic Computing: Models, Algorithms, and Applications)

Download

Browse Figures

Versions Notes

Abstract

Privacy-sensitive electricity scene classification requires robust models under data localization constraints, making federated learning (FL) a suitable framework. Existing FL frameworks face two critical challenges in multi-label electricity scene classification: (1) Label correlations and their strengths significantly impact classification performance. (2) Electricity scene data and labels show distributional inconsistencies across regions. However, current FL frameworks lack explicit modeling of label correlation strengths, and locally trained regional models naturally capture these differences, leading to regional differences in their model parameters. In this scenario, the server’s standard single-stage aggregation often over-averages the global model’s parameters, reducing its discriminative ability. To address these issues, we propose FMMAN, a federated multi-stage attention neural network for multi-label electricity scene classification. The main contributions of this FMMAN lie in label correlation learning and the stepwise model aggregation. It splits the client–server interaction into multiple stages: (1) Clients train models locally to encode features and label correlation strengths after receiving the server’s initial model. (2) The server clusters these locally trained models into K groups to ensure that models within a group have more consistent parameters and generates K prototype models via intra-group aggregation to reduce over-averaging. The K models are then distributed back to the clients. (3) Clients refine their models using the K prototypes with contrastive group-specific consistency regularization to further mitigate over-averaging, and sends the refined model back to the server. (4) Finally, the server aggregates the models into a global model. Experiments on multi-label benchmarks verify that FMMAN outperforms baseline methods.

Keywords:

multi-label electricity scene classification; multi-label learning; federated learning; label-correlation embedding

1. Introduction

In smart grid systems, accurate multi-label electricity scene classification is crucial for ensuring operational safety. However, electricity scenes such as substation image data often contain sensitive information, raising privacy concerns. For instance, scene images of damaged dials may reveal meter readings associated with specific regions, and images of workers might disclose local geographic features or key infrastructure details. Thus, protecting the privacy of these electricity scenes is critical. Federated learning (FL) offers a promising solution, as it enables collaborative model training across distributed data sources without compromising data privacy, ensuring secure data management while maintaining recognition accuracy.

While FL preserves privacy, its performance in multi-label classification is hindered by client-level data heterogeneity and divergent label correlations across clients [1,2]. This challenge intensifies in multi-label electricity scene classification, where regional infrastructure variations (e.g., component-level scenes like blurred dials versus safety-violation scenes like no work clothes and no safety helmet) create distinct feature distributions and label semantic correlation across clients. Explicitly modeling label correlations is critical [3,4] for multi-label classification; for instance, meter dial blurring often co-occurs with mechanical damage, and safety violations like no safety helmet strongly correlate with no work clothes but weakly with smoking. Moreover, capturing varying correlation strengths can enhance multi-label classification performance [5]. Although existing FL frameworks learn client-level data distributions [6,7] and some leverage label correlations, they do not explicitly distinguish varying correlation strengths. Treating strong and weak associations (e.g., no safety helmet ↔ no work clothes vs. smoking) as similarly important undermines model performance, and thus, the critical dependencies are diluted. Moreover, as mentioned, electricity scene data and labels show distributional inconsistencies across regions. These regional inconsistencies in scene characteristics and label associations naturally lead local models to learn and adapt to these distinct features, resulting in substantial variations in their model parameters across different regions. When conventional single-stage global aggregation methods are applied in the server, they attempt to directly aggregate these diverse parameters from local models, which forces an over-averaging of discriminative parameters that are crucial for preserving region-specific features. As a result, the global model, now over-averaged, reduces its ability to effectively capture fine-grained, localized features, ultimately hindering its capacity to generalize in multi-label electricity scene classification tasks.

To solve the above questions, we propose a federated multi-stage attention neural network for multi-label electricity scene classification (FMMAN). FMMAN explicitly learns the features of label correlation strengths in clients and addresses over-averaging in federated multi-label electricity scene classification by decoupling the conventional single-stage global aggregation into a multi-stage aggregation. Specifically, during initialization, the server replicates a single randomly initialized global model into K copies, distributing them to clients, where each client selects one as its local model. Then, a complete cycle of the multi-stage aggregation proceeds through four phases: (1) Clients dynamically construct label correlation strength graphs based on the encoding dependencies of labels and design contrastive group-specific consistency regularization based on the received K models to capture both local correlation patterns and global consistency. After training, clients send their local models to the server. (2) In the server-side, obtained N client models are clustered into K clusters (groups) based on significant parameters (models in the same group have similar parameters), and models within each cluster (group) are aggregated to produce an intermediate global models that capture distinct correlation and data patterns while alleviating the over-averaging of direct aggregation in all N clients by using this similarity grouping processing; these K models are then redistributed to N clients for further refinement (Now, the server has K global models). (3) Clients refine their local models using the updated K global models following the methods in stage (1), mitigating over-averaging by balancing local correlation patterns and global consistency. (4) After iterative refinement, the server aggregates all client models into a unified global model using a consistency-aware mechanism to suppress over-averaging. This cycle repeats, with the server redistributing K copies of the refined global model for subsequent rounds, maintaining adaptability and coherence across clients. Experiments on multi-label benchmarks demonstrate that FMMAN outperforms the commonly used multi-label FL baselines for multi-label electricity scene image classification. The main contributions of this paper are as follows:

(1): Explicit modeling of label correlation strengths. The paper proposes a novel approach that explicitly captures varying label correlation strengths in federated multi-label classification, which is crucial for multi-label electricity scene classification. Unlike existing frameworks, the proposed method differentiates between strong and weak correlations (e.g., “no safety helmet ↔ no work clothes” vs. “no safety helmet ↔ smoking”) by constructing masked label correlation strength graphs, enhancing model performance by preserving critical dependencies that may otherwise be diluted.
(2): Multi-stage aggregation to mitigate over-averaging. The paper proposes a multi-stage aggregation process in FMMAN to address over-averaging in conventional single-stage global aggregation methods. By clustering client models based on parameter similarities and refining them through multiple stages, the model alleviates the loss of region-specific features and adapts to local variations in data and label distributions, thus alleviating the over-averaging.

The remainder of this paper is organized as follows: Section 2 reviews related work, including learning methods for FL and multi-label learning. Section 3 introduces the proposed FMMAN model. Section 4 details the experimental setup and validates the effectiveness of FMMAN through experiments. Finally, Section 5 presents the conclusions.

2. Related Works

Federated learning algorithms, such as the widely used FedAvg [1], typically follow a standard procedure involving local training on client devices, uploading the trained client models to a central server, aggregating these models, and broadcasting the aggregated global model back to the clients. However, when data heterogeneity exists, such as differences in client data distributions, aligning local models with the global model becomes significantly more challenging. Data heterogeneity in federated learning is generally addressed by focusing on two main issues: label distribution skew and domain shift.

Federated learning algorithms can be categorized into two primary technical approaches based on their optimization focus: local training optimization algorithms and aggregation strategy optimization algorithms [8,9,10,11,12,13]. Local training optimizers aim to improve model performance by enhancing the client-side training process [9]. Gradient correction algorithms introduce control variables to estimate and adjust update discrepancies, reducing client drift and improving convergence. Model-contrastive algorithms (e.g., MOON) employ model similarity constraints to prevent local overfitting and feature space bias. Loss function-enhanced algorithms add regularization terms to local objectives, limiting parameter deviation from the global model.

In contrast, aggregation strategy optimizers focus on refining server-side parameter fusion [14]. Weighted aggregation algorithms normalize client updates or dynamically adjust contribution weights to mitigate data volume disparities. Personalized aggregation algorithms preserve client-specific parameters, such as batch normalization layers, to enable local adaptation. Asynchronous aggregation algorithms (e.g., FedAsync) allow non-synchronous updates, reducing communication overhead and enhancing scalability. Both approaches exhibit distinct strengths and limitations, making the choice of algorithm dependent on factors such as data heterogeneity, system resources, and security requirements.

In the context of scene recognition tasks, related research includes the ViT-Tiny [15] model for federated learning, which utilizes a transformer variant with 5.7 million parameters, where the last feature representation has a dimension of 192. Additionally, XueSGCN proposes Scene-Based Graph Convolutional Networks for Federated Multi-Label Classification [16].

Label distribution skew occurs when variations in the local datasets across clients lead to model biases towards the majority class in each client’s dataset. This problem is particularly noticeable in cases where clients have imbalanced data. For instance, in medical image classification across hospitals, one hospital may have data on rare diseases, while others focus on more common conditions. This imbalance can cause significant differences in label distributions across clients, sometimes resulting in situations where the label spaces do not overlap at all. This misalignment can severely degrade the performance of the federated model. To address label distribution skew, several federated learning approaches aim to better align local client biases with the global model. For example, FedHybrid [8] introduces a proximal term that limits gradient updates, helping to improve convergence in the presence of label distribution skew. Similarly, SCAFFOLD [9] uses a control variable for each client based on gradient difference measurements, which helps correct local biases. More recently, FedDDC [10] has adopted the Expectation–Maximization (EM) algorithm to track and address discrepancies in local models by learning an auxiliary local bias variable, further aligning the local models with the global one. Additionally, techniques for multi-label learning, such as multi-label learning methods incorporating label-specific features [4,11], warped multi-label learning [12], and multi-dimensional multi-label classification [13], have shown promise in addressing both label distribution skew and domain shift in federated learning environments.

Despite the aforementioned related studies, current federated learning (FL) frameworks encounter two key challenges in multi-label electricity scene classification: (1) the correlations between labels and their varying strengths have a significant impact on classification performance, and (2) there are distributional inconsistencies in electricity scene data and labels across different regions. Existing FL frameworks do not explicitly model the strengths of label correlations, and as a result, regionally trained models naturally capture these differences, causing variations in model parameters across regions. In this context, the standard single-stage aggregation used by the server often leads to over-averaging of the global model’s parameters, which diminishes its ability to discriminate effectively. To overcome these challenges, we introduce FMMAN, a federated multi-stage multi-label attention neural network, designed specifically for electricity scene classification.

3. Proposed Model

3.1. Federated Learning Problem Description

Federated learning is a decentralized approach to machine learning where multiple clients collaborate on training a shared model without sharing their raw data. In this study, K clients participate in each communication round, each with their own private dataset, denoted as D = {D₁, D₂, …, D_K}. Each client’s task is a multi-label classification problem, where a data point x is labeled as y = [y₁, y₂, …, y_C], with y_i = 1 indicating the presence of the i-th class and y_i = 0 indicating its absence. In this setting, local clients are responsible for predicting the presence or absence of each class within the data. The overarching goal of the federated learning process is to generate a globally aggregated model that effectively handles multi-label classification, thereby addressing the core objective of this task.

W = \arg \min_{W} \sum_{i = 1}^{M^{i}} \frac{M^{i}}{|M|} L_{i} (W),

(1)

where L_i is the loss of i-th local model. In multi-label federated learning, both clients and servers operate within the same label space, which consists of a total of C categories. However, the distribution of these labels can vary significantly across different clients.

3.2. The Details of FMMAN on Learning Label Correlation Strengths on the Clients

In this paper, we propose FMMAN, a federated multi-stage attention neural network designed for multi-label electricity scene classification. The key contributions of FMMAN are in label correlation learning and the progressive model aggregation process. The framework divides the client–server interaction into multiple stages: (1) Clients initially train models locally, encoding features and label correlation strengths after receiving the server’s base model. (2) The server clusters the locally trained models into K groups to ensure greater parameter consistency within each group, then generates K prototype models through intra-group aggregation, minimizing the risk of over-averaging. These K models are then sent back to the clients. (3) Clients further refine their models by applying contrastive group-specific consistency regularization with the K prototypes, further reducing over-averaging, and return the updated models to the server. (4) Finally, the server aggregates the refined models into a global model.

Each client has a local model, and it receives K global models during the interaction with the server. The model structure of FMMAN is shown in Figure 1:

After receiving K models from the server, each client dynamically constructs label correlation strength graphs based on the encoding dependencies of labels and designs contrastive group-specific consistency regularization based on the received K models to capture both local correlation patterns and global consistency. On client devices, FMMAN maintains three components: a state embedding module, a label embedding module, which is designed to learn the strengths of label correlations, and a contrastive group-specific consistency regularization, which is designed to alleviate the over-averaging based on the features of label correlation strengths.

3.2.1. A State Embedding Module

Following the FedLGT, the data features are extracted using ResNet [17], while the labels are represented through a dual embedding system comprising label embeddings (L) and state embeddings (S). Complementing this, the state embeddings S = {s₁, s₂, …, s_C} (with each s_C ∈ R^d) serve as tokens that encode the presence or absence of labels, distinguishing between three possible states: unknown, positive, and negative. These states are represented by encoded token values of −1, 1, and 0, respectively. Only the unknown state contributes to the model’s loss during training, emphasizing its role in guiding the learning process. In the state embedding module, FMMAN utilizes an instance-level state masking mechanism to strengthen the global models’ capacity to learn label correlations tailored to the client’s specific data. This mechanism retains the classes that the global model fails to recognize correctly, along with their semantically related category labels, while masking other irrelevant label information to create a mask vector.

Specifically, during the t-th training round, for a given data, each client utilizes the K global model and its local model to generate K + 1 prediction vectors {P_k = {p_k₁, p_k₂, …, p_kC}, k = 1, …, K + 1}. If the prediction probability p_kc for the c-th class is uncertain, the c-th class correlated labels {p_i, …, p_j} are also considered as learnable correlation labels, and their corresponding state embedding s_C are modified to reflect an unknown state as an instance-level state masking vector; otherwise, the state embedding remains unchanged (during the initialization of the local model, the components of this state vector are randomly initialized to either 0 or 1. In the iterative training, the FMMAN generates K + 1 instance-level state masking vectors {S_k, k = 1, …, K + 1} for the received K global models). This adjustment ensures that the model focuses on uncertain predictions, enhancing its ability to learn the label correlation within the uncertain predictions. Based on these predictions, the K + 1 state embeddings are calibrated to produce {S_k′ = {s_k₁′, s_k₂′, …, s_kC′}, k = 1, …, K + 1}, where each s_kC′ is defined as

{s^{'}}_{k c} = \{\begin{matrix} - 1, τ - ε < p_{k c} < τ - ε \\ s_{k c}, o t h e r w i s e \end{matrix},

(2)

where τ denotes the threshold (we use 0.5 as set in most multi-label works).

This vector is then applied to mask the label embedding features to learn label correlation. In the label embedding module, each label is mapped into a masked embedding vector using the masked state vectors S_k′ to construct a label correlation strength graph to learn the strengths of label correlations.

3.2.2. The Label Embedding Module for Learning the Strength of Label Correlations

The labels are represented through an embedding system label embeddings (L). Specifically, the label embeddings are defined as L = {l₁, l₂, …,l_C}, where each lc ∈ R^d corresponds to the c-th class label, with d denoting the embedding dimension. Our framework operates through two key phases: First, we convert label information into textual representations to generate label-specific embedding features. Subsequently, we construct a masked label embedding graph by integrating these text-derived features with state embeddings. To achieve this, we employ the CLIP vision-language model [18] as our foundational architecture, specifically utilizing its frozen text encoder to produce stable, pre-aligned label embeddings. All models in clients use the same label embeddings L. After this, for all models in the clients, all the K + 1 state embeddings would be combined with the aforementioned label embeddings to form the masked label embeddings

{\tilde{L}}_{k} = L * S_{k}^{'}, in which, \{{\tilde{l}}_{k c} = l_{c} * s_{k c}^{'}, k = 1, \dots, K + 1\} .

(3)

where * denotes the product of the corresponding components. Thus, the masked label embeddings can be formulated as

\{{\tilde{L}}_{k} = \{{\tilde{l}}_{k 1}, \dots, {\tilde{l}}_{k C}\}, k = 1, \dots, K + 1\}

. To model these label correlations, FMMAN constructs a masked label correlation graph for each embedding (each client keeps K + 1 models; it generates K + 1 masked label correlation graphs) to model the strength of label correlations, and each edge corresponds to the similarity between its associated nodes, which can be calculated as

e d g e_{i, j}^{k} = Cosine ({\tilde{l}}_{k i}, {\tilde{l}}_{k j}),

(4)

where Cosine () is the Cosine correlation feature between embeddings. For the correlation features between embeddings i and j in the k-th model, FMMAN uses Cosine similarity, where

{\tilde{l}}_{k i}

is the constructed masked label embedding feature data for embeddings. These edges, after being regularized, are used to construct the label correlation adjacency matrix A, where the correlation values between the nodes of the adjacency matrix A represent the regularized similarity values.

To capture label correlation strength in the masked label embedding graph, we develop a graph attention autoencoder to extract the correlation features. Specifically, FMMAN uses the masked label embedding graph as input for the graph attention autoencoder.

① During the encoding process, FMMAN employs a Cosine attention mechanism to extract label correlation features, represented as parameterized Gaussian distributions with diagonal covariance matrices. This encoding process contains three key components: a similarity function, attention coefficients, and the creation of attention features. The similarity function evaluates the connections between the features derived from masked label embeddings. To address the varying influence of different similarities across inputs, the FMMAN framework employs a Cosine similarity function, which effectively measures the relationships between these features. This function measures the similarity between two input vectors as Formula (5)

S i m i l a r i t y_{i j}^{k} = S L P (C o s i n e (A^{k} \times W^{k} \times {\tilde{l}}_{k i}, A^{k} \times W^{k} \times {\tilde{l}}_{k j})),

(5)

where A^k is the adjacency matrix of k-th masked label embedding graph, W^k is learnable weights, and SLP() is a single-layer neural network. The Cosine similarity function calculates the angle between vectors. Based on this function, attention coefficients are defined as Formula (6)

α_{i j}^{k} = softmax (S i m i l a r i t y_{i j}^{k}) = \frac{\exp (S i m i l a r i t y_{i j}^{k})}{\sum_{j} \exp (S i m i l a r i t y_{i j}^{k})},

(6)

where

α_{i j}^{k}

are the attention coefficients. Thus, the attention feature for the current embeddings i can be expressed as Formula (7)

\begin{array}{l} f e a t u r e_{i}^{k} & = N (\sum_{j} α_{i j}^{k} \times V^{k} \times {\tilde{l}}_{k j}, \sum_{j} α_{i j}^{k} \times U^{k} \times {\tilde{l}}_{k j}) \\ = \sum_{j} α_{i j}^{k} \times V^{k} \times {\tilde{l}}_{k j} + δ \times \sum_{j} α_{i j}^{k} \times U^{k} \times {\tilde{l}}_{k j}, w h e r e, δ \sim N (0, 1) \end{array},

(7)

where V^k and U^k are learnable weights. Using Formulas (5)–(7), the Gaussian graph attention autoencoder encodes the label correlation as embedding features.

② The decoding process of the Gaussian graph attention autoencoder is carried out through two standard attention layers. For training, the Gaussian graph attention autoencoder incorporates a reconstruction loss based on the masked label embedding graph. Its loss function is expressed as Formula (8)

L o s s_{a e} = \frac{1}{N} \sum_{k = 1}^{K + 1} \sum_{n} \{\begin{array}{l} {(X_{n} - {Recon}_{n}^{k})}^{2} + \\ \sum_{i, j} [cosine (f e a t u r e_{i}^{k}, f e a t u r e_{j}^{k}) - A_{i, j}^{k}]_{n}^{2} \end{array}\} .

(8)

where the X_n is the input of the current client, and Recon_n^k is the reconstructed data in the Gaussian graph attention autoencoder of the k-th model. In Formula (8), the second term of the loss function introduces a feature consistency term.

Based on this Gaussian graph attention autoencoder, the extracted features from the encoder are then converted into vectors F, serving as input for the transformer using Formula (4). The features are then integrated into the backbone model by concatenating the scene features F_scene with the encoded label correlation embeddings, forming a comprehensive input that is subsequently fed into the backbone model. Thus, one has

O u t p u t_{n}^{k} = w^{k} (c o n c a t e (F_{s c e n e, n}, F_{n}^{k})),

(9)

where F_scene denotes sense features, F^k encodes label correlation strength embeddings features in the k-th model, and Output is the predicted logit (w^k denotes the multi-label classification network). The classification loss is expressed as using the cross entropy loss:

L o s s_{c l y} = \sum_{k = 1}^{K + 1} \sum_{n} C o r s s E n t r o p y (O u t p u t_{n}^{k}, L a b e l_{n}) .

(10)

3.2.3. Contrastive Group-Specific Consistency Regularization for Alleviating the Over-Averaging

To ensure the consistency of model features on the client side and to alleviate the issue of model parameter inconsistency caused by data heterogeneity, we introduce contrastive group-specific consistency regularization into the loss function. This approach mitigates parameter inconsistency and helps reduce model over-averaging on the client side. To achieve this, FMMAN designs contrastive group-specific consistency regularization for the client’s local model and the K global models it obtains:

L o s s_{c o n t r a s t} = \sum_{i = 1}^{K} - \log \frac{\exp (C o s i n e (z_{l o c a l}, z_{i}))}{\sum_{j = 1}^{K} \exp (C o s i n e (z_{l o c a l}, z_{j}))} + μ (θ_{l o c a l} - θ_{i}) .

(11)

Here, z represents the features of the model, where the scene features F_scene embedded with label correlation are used in this paper. z_local refers to the features of the client model, and z_i represents the features of the i-th global models. All the obtained K global models are used in this group-specific consistency regularization. The last term in Formula (11) is the consistency regularization of the model parameters. For the current model i, θ_local is the parameter of this local model, θ_i is the parameter of i-th global mode. In the initial stage, θ_i is set to θ_local.

Therefore, the total loss of the model in clients can be denoted as

L o s s = L o s s_{c l y} + L o s s_{a e} + μ L o s s_{c o n t r a s t} .

(12)

3.3. Multi-Stage Cluster-Driven Global Aggregation with Consistency-Aware Mechanism on Server

One primary objective of the federated learning process is to generate a globally aggregated model that effectively performs multi-label classification. However, since different clients have distinct label correlations, the models trained on these clients may prioritize these correlations differently. As a result, mismatches in label correlations across clients can lead to inaccurate multi-label predictions, negatively affecting the overall performance of the federated learning system.

To resolve this, FMMAN explicitly learns the features of label correlation strengths in clients and addresses over-averaging in federated multi-label electricity scene classification by decoupling the conventional single-stage global aggregation into a multi-stage aggregation. Specifically, in traditional federated learning (FL), client–server interaction occurs in a single step: the server distributes the global model to clients, each client trains the global model using its local data to generate a local model and sends the local model back to the server, and the server aggregates these models into an updated global model, completing one iteration. In FMMAN, however, to enable the global model to learn inconsistent label correlation features, the interaction is decoupled into a multi-stage process. The process is shown in Figure 2. Specifically, it consists of the following steps: (1) Clients build label correlation graphs and use contrastive regularization with K models to capture local and global patterns, then send their local models to the server. (2) The server clusters N client models into K groups, aggregates models within each group to form global models, and redistributes them for further refinement. (3) Clients refine their models using the updated global models, balancing local and global consistency to prevent over-averaging. (4) After iterative refinement, the server aggregates client models into a unified global model, suppressing over-averaging, and redistributes it for further rounds (⑤ and ⑥).

(1): After initialization, clients dynamically construct label correlation strength graphs and design contrastive group-specific consistency regularization based on the received K models to capture both local correlation patterns and global consistency based on the methods in Section 3.2. Then, all N clients send their local models to the server.
(2): In the server-side, obtained N client models are clustered into K clusters (groups) based on significant parameters.

N_{k} = \arg \underset{P a r a_{k}}{m i n} (S i m i l a r i t y (P a r a_{k}))

(13)

It should be emphasized that, given the large number of model parameters, instead of using all parameters from every local model for clustering, we select a subset of representative parameters. The representative vectors for clustering are the parameters from the first layer of the transformer module in the local models, which directly manage the embedded label correlation features.

Then we aggregate models within each cluster, and generate K distinct global models.

W_{k} = \arg \min_{W_{k}} \sum_{i = 1}^{M^{i}} \frac{M^{i}}{|M|} L_{i} (W_{k})

(14)

The k in Formula (13) is the number of clusters.

(3): These K global models are redistributed to all N clients, where each client trains them locally (using the same client-side learning method) based on Formula (12). This training produces a refined local model for learning different local correlations and data features, which is subsequently returned to the server.
(4): Finally, the server aggregates all N models into a unified global model, completing one full interaction cycle. Specifically, the model first calculates the loss values of the client models transmitted to the server. Then, for these N models, the contribution of each model’s loss to the aggregation is weighted, with the weight being the proportion of that model’s loss in the total loss of the N models. This aggregation process is expressed as follows:

W = \arg \min_{W} \sum_{i = 1}^{N} \frac{l o s s_{i}}{\sum_{j} l o s s_{j}} \frac{M^{i}}{|M|} L_{i} (W)

(15)

Next, we analyze the effectiveness of the proposed FMMAN based on experiments.

4. Experiments

To verify the effectiveness of our proposed learning framework, we evaluate it on three datasets: one public benchmark dataset (FLAIR [19]) for comparative analysis and generalizability validation, and two domain-specific electricity scene image classification datasets: the Substation Defect Detection Dataset and Substation Control Cabinet Condition Monitoring Dataset. To ensure compatibility with FL workflows and privacy preservation, the 2 electricity scene image datasets are adapted for federated learning (FL) by partitioning them into decentralized, client-specific subsets through artificial non-IID splits, replicating real-world data distribution scenarios. For fair comparison, we adopt the same FL dataset partitioning method as used in Reference [7].

The experiments are organized as follows: First, we provide an overview of the datasets. Next, we conduct comparative experiments to validate the efficacy of the proposed FMMAN framework. Finally, ablation studies are performed to analyze the contributions of the label correlation embedding graphs and the adaptive aggregation strategy to the global model’s performance.

4.1. Datasets and Parameters Setting

(1) FLAIR is a large scale multi-label FL dataset, which contains a wide variety of photos collected from real users on Flickr. FLAIR provides real-user data partitions with each input image in 256 pixels × 256 pixels. Thus, FLAIR naturally captures various non-IID characteristics, including quantity skew (i.e., users have different numbers of samples), label distribution skew, and domain shift, leading to a more challenging scenario for FL. FLAIR is defined using a two-level hierarchy: one task is coarse-grained with 17 categories, and the other is fine-grained with 1628 categories. In this paper, we only perform classification on coarse-grained labels. This dataset is available on (GitHub—apple/ml-flair: a large labeled image dataset for benchmarking in federated learning) (accessed on 20 April 2022).

(2) Substation Defect Detection Dataset contains 8307 images capturing substation equipment under diverse real-world conditions, annotated with 17 defect labels. These include component-level failures (e.g., blurred dial, cracked insulator, damaged cover plate), safety violations (absence of safety helmet/workwear, smoking), and system-level anomalies (abnormal system operating status, switchgear equipment anomalies). This dataset and its details are available on https://pan.baidu.com/s/1qCIGlCi54AwY0b_qX9sG2A?pwd=cuth (accessed on 3 April 2025).

(3) Substation Control Cabinet Condition Monitoring Dataset comprises 3000 annotated images spanning 16 distinct categories, with labels including green-green-off, green-green-red, platen-off, platen-on, platen-on-half, red, red-green, red-red, red-red-off, switch-center, switch-center-half, switch-left, switch-right, transformer, transformer-on, and transformer-on-half, designed to capture multi-state operational conditions (e.g., switch positions, transformer statuses, and indicator light combinations) for predictive maintenance and anomaly detection in power substation control systems. This dataset and its details are available on https://pan.baidu.com/s/1vKIp4EwjTsD4vL7c2DVbYg?pwd=gmup (accessed on 3 April 2025).

(4) FL-PASCAL VOC [20] is a widely used benchmark in computer vision, comprising 9963 natural scene images with variable resolutions, typically around 300 pixels × 500 pixels. It covers 20 object categories (e.g., humans, animals, vehicles) and contains approximately 24,640 annotated object instances. Each image includes XML-formatted bounding box annotations and category labels, with a subset also providing pixel-level segmentation masks. To facilitate comparison, the federated settings of this dataset are aligned with reference [7]. This dataset and its details are available on http://host.robots.ox.ac.uk/pascal/VOC/voc2007/ (accessed on 26 Mar 2007).

For the evaluation, we use the metrics which include macro and micro average precision (AP), precision, recall, and F1 scores. These metrics offer a thorough evaluation of the model’s performance, capturing both individual class performance and overall effectiveness in multi-label classification tasks. For all the datasets mentioned, we utilize ResNet-18 [17] as the backbone network for vision-related tasks. To create universal label embeddings, we generate prompt label text, which is fed into the text encoder in CLIP [18]. Following the setup in Reference [7], we set the threshold τ to 0.5 and the uncertainty margin ε to 0.015 for the CA-MLE mechanism. In each round of local training, we run 10 epochs of optimization using the Adam optimizer, with a learning rate of 5 × 10⁻⁴ and a batch size of 16; the parameter μ in group-specific consistency regularization is 0.01. In the federated learning (FL) setup, the number of communication rounds (T) is fixed at 50 per clustering operation on the server. The fraction of active clients in each round is set to ensure 20 clients participate, maintaining a data distribution that is representative of the whole population. The server starts with three clusters (K = 3) during the first clustering, and clustering is performed using the k-means algorithm. All experiments are conducted using PyTorch 2.0 and trained on 1 NVIDIA RTX 4090 GPUs.

4.2. Comparison of Experiments

To verify the effectiveness of the proposed FMMAN, we compare it with several widely adopted models in the field of federated learning. These models include both convolutional deep learning models and transformer-based architectures, which represent two key approaches in modern deep learning. First, we will compare it with the AvgFL algorithm introduced in [1], which applies an averaging strategy for federated learning. In this approach, local models are trained on each client using their respective local data, and the central server aggregates the updated models by iteratively averaging their parameters. While straightforward, this method can struggle with data heterogeneity. Next, we will evaluate the MOON framework, proposed in [21], which addresses the issue of data heterogeneity by adding a proximal term to the local objective function for each client. This strategy aims to align the local models more closely, potentially enhancing convergence in federated learning settings. Additionally, we will compare it with FedLGT, a state-of-the-art (SOTA) multi-label classification model in federated learning. FedLGT is specifically designed for multi-label classification tasks in federated environments, and analyzing its architecture and techniques will provide valuable insights into its performance.

Model details are listed as follows:

AvgFL-MLP-Mixer: A FL model using MLP-Mixer in clients for training. The MLP-Mixer [22] is introduced as an alternative computer vision model to CNNs and Vision Transformers (ViT) to reduce the computational time.
AvgFL-ConvMixer: An AvgFL model using ConvMixer in clients for training. The ConvMixer architecture builds upon the MLP-Mixer by including channel and token mixing mechanisms to process channel and spatial features.
AvgFL-PoolFormer [2]: An AvgFL model using PoolFormer in clients for training. The PoolFormer architecture replaces the attention-based token mixer module by the simple average pooling operation as a token mixer.
AvgFL-ResNet-50: An AvgFL model using ResNet-50 in clients for training.
FL-C_Tran [5]: An FL model using C_Tran in clients for training.
FedCor [23]: Correlation-Based Active Client Selection Strategy for Heterogeneous Federated Learning.
MOON-ConvMixer: A MOON FL model using ConvMixer in clients for training.
MOON-PoolFormer [2]: A MOON FL model using PoolFormer in clients for training.
MOON-ResNet-50: A MOON FL model using ResNet-50 in clients for training.
FedLGT [7]: FedLGT is served as a customized model update technique while exploiting the label correlation at each client.
FedHybrid [8]: FedHybrid introduces a proximal term that limits gradient updates, helping to improve convergence in the presence of label distribution skew.
ViT-Tiny [15]: ViT-Tiny model for FL, on the other hand, utilized a transformer variant with 5.7 million parameters. The last feature representation has a dimension of 192.
XueSGCN [16]: Scene-based Graph Convolutional Networks for Federated Multi-Label Classification.

4.2.1. Comparison of Experiments on FLAIR Dataset and FL-PASCAL VOC

First, we test the model’s performance on the FLAIR dataset and FL-PASCAL VOC. The results are shown in Table 1 and Table 2.

From Table 1, FMMAN demonstrates its effectiveness in the FLAIR dataset through federated learning, outperforming other algorithms across most metrics. With the highest APs, FMMAN exhibits better classification performances. Furthermore, FMMAN achieves the highest F1 scores, showcasing a balance between precision and recall. Compared to other algorithms like the AvgFL and MOON series, as well as FedCor and XueSGCN, FMMAN consistently delivers better results, highlighting its effectiveness in handling data heterogeneity and label correlations.

From Table 2, FMMAN demonstrates exceptional performance in the FL-PASCAL VOC dataset through federated learning, outperforming other algorithms across all key metrics. This makes FMMAN the most effective model for federated learning on the FL-PASCAL VOC dataset, offering significant advantages in AP and F1 scores.

Next, we present the test results of the proposed FMMAN alongside the compared models, focusing on precision and recall. The test results for the various models across the datasets are illustrated in Figure 3.

As shown in Figure 3, the proposed FMMAN demonstrates exceptional precision and recall results in the FLAIR dataset through federated learning, consistently outperforming other algorithms. Furthermore, FMMAN’s performance surpasses even the state-of-the-art FedLGT algorithm and FedHybrid algorithm.

The above experiments indicate that FMMAN is generic in handling data heterogeneity and label correlations for multi-label image classification under FL framework.

4.2.2. Comparison of Experiments on Substation Defect Detection Dataset

In the next experiment, we verify the effectiveness of the proposed FMMAN on the Substation Defect Detection dataset. The results are shown in Table 3.

FMMAN shows exceptional performance in the Substation Defect Detection dataset through federated learning, outperforming other algorithms across most key metrics. With the highest F1 scores, FMMAN exhibits better classification results compared to other models like AvgFL, MOON series, FL-C_Tran, and FedLGT. However, we also observe that the FMMAN model shows significant differences in performance between Macro and Micro metrics. This is due to the class imbalance in the Substation Defect Detection dataset, as well as the imbalance in label correlations. As a result, the model performs poorly on certain categories in Macro metrics like F1 score, which leads to a decline in Macro metric. The F1 score of each class is shown in Figure 4.

Next, we present the test results for the proposed FMMAN, comparing its performance with other models, with a particular focus on precision and recall. The test precision and recall values for the various models on the Substation Defect Detection dataset are shown in Figure 5.

As illustrated in Figure 5, the proposed FMMAN algorithm achieves the higher precision and recall values on the Substation Defect Detection dataset in a federated learning setting, outperforming other algorithms. In addition, we can see that FMMAN outperforms FedLGT, and FedLGT performs significantly better than other models. This indicates that incorporating label correlation strength information into the model can effectively improve the performance of federated learning models on the Substation Defect Detection dataset.

4.2.3. Comparison of Experiments on Substation Control Cabinet Condition Monitoring Dataset

In the next experiment, we verify the effectiveness of the proposed FMMAN on Substation Control Cabinet Condition Monitoring dataset. The results are shown in Table 4.

As shown in Table 4, the proposed FMMAN achieves the best testing metrics on the Substation Control Cabinet Condition Monitoring dataset, which indicates that the label correlation learning module and the grouping clustering method designed in this paper are effective. It is worth noting that the Macro and Micro metrics still show a significant gap in the Substation Control Cabinet Condition Monitoring dataset.

Next, we present the test results for the proposed FMMAN, comparing its performance with other models, with a particular focus on precision and recall. The test precision and recall values for the various models on the Control Cabinet Condition Monitoring dataset are shown in Figure 6.

As illustrated in Figure 6, the proposed FMMAN algorithm achieves the best precision and recall results on the Control Cabinet Condition Monitoring dataset.

4.3. Discussion

The experimental results presented in this study robustly demonstrate the effectiveness of the proposed FMMAN model for multi-label image classification in federated learning scenarios. Across a diverse range of datasets, FMMAN consistently outperforms the existing state-of-the-art models. This includes both publicly available benchmarks like FLAIR and FL-PASCAL VOC, as well as domain-specific electricity scene image classification datasets such as Substation Defect Detection and Substation Control Cabinet Condition Monitoring.

On the FLAIR dataset, which is characterized by complex data heterogeneity and label correlations, FMMAN achieves the highest macro-AP and macro-F1 scores. These metrics highlight FMMAN’s capability to effectively handle the non-IID characteristics present in real-world federated learning settings, including quantity skew, label distribution skew, and domain shift. The superior performance indicates that FMMAN is well-suited for addressing the challenges posed by heterogeneous data distributions.

Similarly, on the FL-PASCAL VOC dataset, FMMAN surpasses other models across all key metrics, including macro-AP, macro-F1, micro-AP, and micro-F1. This demonstrates FMMAN’s generalizability and robustness, indicating that its performance advantages are not limited to a specific dataset or domain but extend to a wide range of multi-label image classification tasks in federated learning.

In the domain-specific electricity scene image classification datasets, FMMAN’s effectiveness is further validated. On the Substation Defect Detection dataset, despite the challenges posed by class imbalance and label correlation imbalance, FMMAN still achieves the highest micro-AP and micro-F1 scores. This showcases FMMAN’s ability to learn from imbalanced data distributions, which is crucial for real-world applications where data is often unevenly distributed across classes.

On the Substation Control Cabinet Condition Monitoring dataset, FMMAN also achieves the best results across all metrics, underscoring the effectiveness of its label correlation learning module and grouping clustering method. These components enable FMMAN to better capture and leverage label correlations, leading to improved performance in multi-label classification tasks.

4.4. Ablation Experiments

In this section, we perform ablation studies on the Substation Defect Detection dataset to highlight the contributions of the label correlation embedding graphs and the aggregation strategy for the FMMAN. The results are shown in Table 5 and Figure 7.

In Table 5, FMMAN_without_graph is a FMMAN without graph structure learning, and FMMAN_without_mask is the FMMAN which is not using the state mask vector for training in the clients. In the ablation studies, FMMAN_without_mask achieves the worst results, which means that the masking process is important for learning the label correlation. FedLGT-MS is the FedLGT that is using multi-step aggregation, and FMMAN_without_contrastive is the FMMAN which is not using the contrastive group-specific consistency regularization. Moreover, as Table 4 shows, the FMMAN_without_graph performs better than FMMAN_without_mask and FedLGT, but FMMAN performs better than FMMAN_without_graph, which means that the graph embedding is effective for improving the classification results. Next, we show the results for precision and recall values.

As Table 5 and Figure 6 show, FMMAN achieves the best test results, which verify the effectiveness of the designed label correlation embedding graphs. Next, we show the effectiveness of the aggregation strategy for the FMMAN. The results are shown in Table 6 and Figure 8.

In Table 6, FMMAN_without is a FMMAN without k-means aggregation strategy. In the ablation studies, FMMAN performs better than FMMAN_without, which means that the aggregation strategy is effective. Next, we show the results for precision and recall values.

As Table 6 and Figure 8 show, the FMMAN achieves the best test results, which verify the effectiveness of the aggregation strategy for the FMMAN.

5. Conclusions

In this paper, we propose FMMAN, a federated multi-stage attention neural network for multi-label electricity scene classification in smart grid systems. By decoupling conventional single-stage global aggregation into a hierarchical, label correlation-embedded process, FMMAN addresses the critical challenges of parameter over-averaging and inconsistent label correlation modeling in federated multi-label scenarios. The framework introduces three key innovations: (1) client-specific label correlation strength graphs constructed via masked label embeddings, which explicitly quantify semantic correlations conditioned on local data distributions; (2) a server-side clustering mechanism that groups client models by their correlation patterns and aggregates them into intermediate global models, preserving regionally discriminative features; (3) dynamic cross-cluster knowledge fusion with group-specific consistency regularization, enabling adaptive refinement of underrepresented correlation regimes while mitigating over-averaging. Experiments demonstrate that FMMAN outperforms existing federated multi-label baselines, achieving superior generalization across heterogeneous electricity scene classification tasks. This work provides a solution for maintaining correlation in federated smart grid monitoring systems, where preserving region-specific label dependencies is crucial for operational safety. Future extensions will explore temporal correlation dynamics in evolving power infrastructure scenarios.

Although the algorithm proposed in this paper outperforms previous algorithms in terms of classification capability, it also introduces new challenges. To mitigate the over-averaging problem, we modified the original single-step aggregation process to a multi-step aggregation process, and introduced additional models (with k = 3 in this paper) during communication between the server and clients. While these adjustments improve the scene recognition capability, they also increase the communication burden of federated learning. When we proposed the FMMAN algorithm, we primarily focused on the context of power scene recognition, where the emphasis is on the model’s classification ability to ensure that abnormal scenes can be detected and identified. Communication burden was not the primary focus, as we had already set up a cloud-edge system platform with high bandwidth. In the future, we will explore new methods to address the communication burden, allowing us to expand our algorithm to regular classification tasks rather than just anomaly detection tasks.

Author Contributions

Conceptualization, L.Z. and K.Z.; methodology, K.Z.; software, L.Z.; validation, X.J., K.Z., C.M. and J.X.; formal analysis, M.W.; investigation, L.G. and D.Z.; resources, L.Z.; data curation, M.W.; writing—original draft preparation, K.Z.; writing—review and editing, K.Z.; visualization, D.Z. and Y.A.; supervision, L.Z.; project administration, X.J.; funding acquisition, K.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Science and Technology Project of China Southern Power Grid Co., Ltd. (035900KK52222003).

Data Availability Statement

The data supporting this study’s findings are available from the corresponding author, Kaihong Zheng, upon reasonable request. We also give the URL of the data in this paper.

Conflicts of Interest

Authors L.Z., X.J., J.X., M.W., L.G., C.M. and D.Z. were employed by the Hainan Power Grid Co., Ltd.. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
Büyüktaş, B.; Weitzel, K.; Völkers, S.; Zailskas, F.; Demir, B. Transformer-based Federated Learning for Multi-Label Remote Sensing Image Classification. arXiv 2024, arXiv:2405.15405. [Google Scholar]
Zhang, M.L.; Zhou, Z.H. A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 2013, 26, 1819–1837. [Google Scholar] [CrossRef]
Zhang, J.; Wei, T.; Zhang, M.L. Label-specific time-frequency energy-based neural network for instrument recognition. IEEE Trans. Cybern. 2024, 54, 7080–7093. [Google Scholar] [CrossRef] [PubMed]
Hang, J.Y.; Zhang, M.L. Collaborative learning of label semantics and deep label-specific features for multi-label classification. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 9860–9871. [Google Scholar] [CrossRef] [PubMed]
Lanchantin, J.; Wang, T.; Ordonez, V.; Qi, Y. General multi-label image classification with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 16478–16488. [Google Scholar]
Liu, I.J.; Lin, C.S.; Yang, F.E.; Wang, Y.C.F. Language-Guided Transformer for Federated Multi-Label Classification. AAAI Conf. Artif. Intell. 2024, 38, 13882–13890. [Google Scholar] [CrossRef]
Niu, X.; Wei, E. FedHybrid: A hybrid federated optimization method for heterogeneous clients. IEEE Trans. Signal Process. 2023, 71, 150–163. [Google Scholar] [CrossRef]
Huang, X.; Li, P.; Li, X. Stochastic controlled averaging for federated learning with communication compression. arXiv 2024, arXiv:2308.08165. [Google Scholar]
Gao, L.; Fu, H.; Li, L.; Chen, Y.; Xu, M.; Xu, C.Z. Feddc: Federated learning with non-iid data via local drift decoupling and correction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10112–10121. [Google Scholar]
Li, J.; Zhang, C.; Zhou, J.T.; Fu, H.; Xia, S.; Hu, Q. Deep-LIFT: Deep label-specific feature learning for image annotation. IEEE Trans. Cybern. 2021, 52, 7732–7741. [Google Scholar] [CrossRef] [PubMed]
Yu, Z.B.; Zhang, M.L. Multi-label classification with label-specific feature generation: A wrapped approach. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 5199–5210. [Google Scholar] [CrossRef] [PubMed]
Jia, B.B.; Zhang, M.L. Multi-dimensional multi-label classification: Towards encompassing heterogeneous label spaces and multi-label annotations. Pattern Recognit. 2023, 138, 109357. [Google Scholar] [CrossRef]
Guan, H.; Yap, P.T.; Bozoki, A.; Liu, M. Federated learning for medical image analysis: A survey. Pattern Recognit. 2024, 151, 110424. [Google Scholar] [CrossRef] [PubMed]
Ben Youssef, B.; Alhmidi, L.; Bazi, Y.; Zuair, M. Federated Learning Approach for Remote Sensing Scene Classification. Remote Sens. 2024, 16, 2194. [Google Scholar] [CrossRef]
Xue, S.; Luo, W.; Luo, Y.; Yin, Z.; Gu, J. Scene-based Graph Convolutional Networks for Federated Multi-Label Classification. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–6. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
Radford, A. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021. [Google Scholar]
Song, C.; Granqvist, F.; Talwar, K. FLAIR: Federated Learning Annotated Image Repository. arXiv 2022, arXiv:2207.08869. [Google Scholar]
Hoiem, D.; Divvala, S.K.; Hays, J.H. Pascal VOC 2008 challenge. World Lit. Today 2009, 24, 1–4. [Google Scholar]
Li, Q.; He, B.; Song, D. Model-contrastive federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 10713–10722. [Google Scholar]
Tolstikhin, I.O.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J.; et al. Mlp-mixer: An all-mlp architecture for vision. Adv. Neural Inf. Process. Syst. 2021, 34, 24261–24272. [Google Scholar]
Tang, M.; Ning, X.; Wang, Y.; Sun, J.; Wang, Y.; Li, H.; Chen, Y. FedCor: Correlation-based active client selection strategy for heterogeneous federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10102–10111. [Google Scholar]

Figure 1. The structure of the proposed FMMAN in local clients (our main contributions are indicated in blue). The upper part is the collaborative training between the local model and the K global models, and the section below provides the details of the model. Building upon the architecture of FedLGT, FMMAN employs a backbone and transformer to extract discriminative electricity scene features [7].

Figure 2. The interaction between clients and the server.

Figure 3. The performance of FMMAN compared with the other models on precision and recall in FLAIR dataset.

Figure 4. The F1 scores on Substation Defect Detection dataset.

Figure 5. The performance of FMMAN compared with the other models in terms of precision and recall on Substation Defect Detection dataset.

Figure 6. The performance of FMMAN compared with the other models in terms of precision and recall on Control Cabinet Condition Monitoring dataset.

Figure 7. The performance of FMMAN in ablation studies of the designed label correlation embedding graphs.

Figure 8. The performance of FMMAN in ablation studies of the aggregation strategy.

Table 1. Results on FLAIR dataset.

	Macro-AP	Macro-F1	Micro-AP	Micro-F1
AvgFL-MLP-Mixer	40.91%	36.09%	77.33%	67.44%
AvgFL-ConvMixer	41.46%	37.31%	78.71%	67.98%
AvgFL-PoolFormer	42.62%	38.22%	79.16%	68.63%
AvgFL-ResNet-50	40.13%	33.51%	77.12%	66.09%
MOON-ConvMixer	41.91%	38.71%	79.20%	69.15%
MOON-PoolFormer	42.90%	40.03%	79.59%	69.08%
MOON-ResNet-50	43.32%	36.53%	77.34%	67.45%
FL-C_Tran	56.00%	43.02%	88.15%	76.71%
FedLGT	60.60%	54.94%	88.72%	86.23%
FedHybrid	61.03%	54.93%	88.43%	86.17%
FedCor	61.33%	54.07%	87.62%	86.10%
ViT-Tiny	60.72%	53.87%	87.67%	86.10%
XueSGCN	61.13%	54.98%	87.99%	86.22%
FMMAN	62.15%	55.58%	88.94%	86.71%

Table 2. Results on PASCAL VOC dataset.

	Macro-AP	Macro-F1	Micro-AP	Micro-F1
AvgFL-ConvMixer	86.93%	80.19%	91.41%	84.21%
AvgFL-PoolFormer	86.94%	81.34%	91.62%	85.24%
AvgFL-ResNet-50	87.77%	82.53%	91.74%	85.26%
MOON-ConvMixer	86.77%	81.03%	91.43%	84.33%
MOON-PoolFormer	87.85%	81.28%	91.71%	86.10%
MOON-ResNet-50	88.39%	81.32%	91.45%	84.51%
FL-C_Tran	89.01%	84.13%	91.72%	87.46%
FedLGT	89.72%	84.25%	91.79%	88.43%
FedCor	89.73%	84.87%	93.02%	87.77%
ViT-Tiny	90.02%	85.37%	93.13%	88.18%
XueSGCN	89.11%	85.02%	92.79%	88.25%
FMMAN	91.66%	86.95%	94.37%	89.70%

Table 3. Results of FMMAN compared with the commonly used models on Substation Defect Detection Dataset.

	Macro-AP	Macro-F1	Micro-AP	Micro-F1
AvgFL-MLP-Mixer	50.02%	43.01%	70.9%	67.60%
AvgFL-ConvMixer	50.05%	43.18%	71.01%	69.41%
AvgFL-PoolFormer	50.33%	44.02%	71.5%	69.16%
AvgFL-ResNet-50	49.19%	42.26%	70.08%	68.69%
MOON-ConvMixer	51.19%	44.12%	72.94%	71.66%
MOON-PoolFormer	50.27%	44.21%	72.32%	71.36%
MOON-ResNet-50	49.28%	43.07%	70.22%	70.19%
FL-C_Tran	58.47%	50.02%	73.23%	72.47%
FedLGT	65.23%	60.98%	79.72%	77.92%
FedHybrid	65.44%	61.08%	79.44%	78.02%
FedCor	65.23%	61.56%	79.58%	77.78%
ViT-Tiny	66.04%	61.72%	79.23%	77.34%
XueSGCN	66.13%	61.88%	79.67%	77.79%
FMMAN	67.14%	62.33%	80.26%	78.76%

Table 4. Results on Substation Control Cabinet Condition Monitoring dataset.

	Macro-AP	Macro-F1	Micro-AP	Micro-F1
AvgFL-MLP-Mixer	62.76%	59.04%	70.17%	69.93%
AvgFL-ConvMixer	63.34%	60.23%	70.39%	71.32%
AvgFL-PoolFormer	63.88%	60.43%	71.72%	71.28%
AvgFL-ResNet-50	63.27%	60.09%	70.93%	70.77%
MOON-ConvMixer	64.06%	60.34%	72.66%	71.31%
MOON-PoolFormer	63.25%	60.37%	71.74%	71.42%
MOON-ResNet-50	63.73%	60.21%	70.78%	70.29%
FL-C_Tran	66.97%	61.09%	73.99%	72.67%
FedLGT	67.14%	63.36%	74.37%	72.71%
FedCor	67.39%	63.36%	74.78%	72.70%
ViT-Tiny	67.44%	63.72%	74.28%	73.06%
XueSGCN	67.68%	63.77%	74.94%	72.98%
FMMAN	68.33%	64.37%	75.81%	73.73%

Table 5. Performance of FMMAN in the ablation experiment for label correlation embedding.

	Macro-AP	Macro-F1	Micro-AP	Micro-F1
FMMAN_without_graph	66.87%	61.19%	79.93%	78.41%
FMMAN_without_mask	65.41%	60.66%	79.72%	78.40%
FedLGT	65.23%	60.98%	79.72%	77.92%
FedLGT-MS	65.73%	61.25%	79.83%	78.32%
FMMAN_without_contrastive	67.04%	61.97%	79.76%	78.53%
FMMAN	67.14%	62.33%	80.26%	78.76%

Table 6. Performance of FMMAN in the ablation experiment for aggregation strategy.

	Macro-AP	Macro-F1	Micro-AP	Micro-F1
FedLGT	65.23%	60.98%	79.72%	77.92%
FMMAN_without	66.55%	61.26%	79.77%	70.08%
FMMAN	67.14%	62.33%	80.26%	78.76%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhong, L.; Jiang, X.; Xu, J.; Zheng, K.; Wu, M.; Gao, L.; Ma, C.; Zhu, D.; Ai, Y. Federated Multi-Stage Attention Neural Network for Multi-Label Electricity Scene Classification. J. Low Power Electron. Appl. 2025, 15, 46. https://doi.org/10.3390/jlpea15030046

AMA Style

Zhong L, Jiang X, Xu J, Zheng K, Wu M, Gao L, Ma C, Zhu D, Ai Y. Federated Multi-Stage Attention Neural Network for Multi-Label Electricity Scene Classification. Journal of Low Power Electronics and Applications. 2025; 15(3):46. https://doi.org/10.3390/jlpea15030046

Chicago/Turabian Style

Zhong, Lei, Xuejiao Jiang, Jialong Xu, Kaihong Zheng, Min Wu, Lei Gao, Chao Ma, Dewen Zhu, and Yuan Ai. 2025. "Federated Multi-Stage Attention Neural Network for Multi-Label Electricity Scene Classification" Journal of Low Power Electronics and Applications 15, no. 3: 46. https://doi.org/10.3390/jlpea15030046

APA Style

Zhong, L., Jiang, X., Xu, J., Zheng, K., Wu, M., Gao, L., Ma, C., Zhu, D., & Ai, Y. (2025). Federated Multi-Stage Attention Neural Network for Multi-Label Electricity Scene Classification. Journal of Low Power Electronics and Applications, 15(3), 46. https://doi.org/10.3390/jlpea15030046

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Federated Multi-Stage Attention Neural Network for Multi-Label Electricity Scene Classification

Abstract

1. Introduction

2. Related Works

3. Proposed Model

3.1. Federated Learning Problem Description

3.2. The Details of FMMAN on Learning Label Correlation Strengths on the Clients

3.2.1. A State Embedding Module

3.2.2. The Label Embedding Module for Learning the Strength of Label Correlations

3.2.3. Contrastive Group-Specific Consistency Regularization for Alleviating the Over-Averaging

3.3. Multi-Stage Cluster-Driven Global Aggregation with Consistency-Aware Mechanism on Server

4. Experiments

4.1. Datasets and Parameters Setting

4.2. Comparison of Experiments

4.2.1. Comparison of Experiments on FLAIR Dataset and FL-PASCAL VOC

4.2.2. Comparison of Experiments on Substation Defect Detection Dataset

4.2.3. Comparison of Experiments on Substation Control Cabinet Condition Monitoring Dataset

4.3. Discussion

4.4. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI