Dynamic Visual Privacy Governance Using Graph Convolutional Networks and Federated Reinforcement Learning

Chih Yang; Wei-Xun Lu; Ray-I Chang

doi:10.3390/electronics14193774

,

and

¹

Department of Engineering Science and Ocean Engineering, National Taiwan University, Taipei 10617, Taiwan

²

Department of Information Management, National Taiwan University of Science and Technology, Taipei 10617, Taiwan

^*

Authors to whom correspondence should be addressed.

Electronics2025, 14(19), 3774;https://doi.org/10.3390/electronics14193774

This article belongs to the Special Issue Security and Privacy for Modern Wireless Communication Systems, 3rd Edition

Version Notes

Order Reprints

Abstract

The proliferation of image sharing on social media poses significant privacy risks. Although some previous works have proposed to detect privacy attributes in image sharing, they suffer from the following shortcomings: (1) reliance only on legacy architectures, (2) failure to model the label correlations (i.e., semantic dependencies and co-occurrence patterns among privacy attributes) between privacy attributes, and (3) adoption of static, one-size-fits-all user preference models. To address these, we propose a comprehensive framework for visual privacy protection. First, we establish a new state-of-the-art (SOTA) architecture using modern vision backbones. Second, we introduce Graph Convolutional Networks (GCN) as a classifier head to counter the failure to model label correlations. Third, to replace static user models, we design a dynamic personalization module using Federated Learning (FL) for privacy preservation and Reinforcement Learning (RL) to continuously adapt to individual user preferences. Experiments on the VISPR dataset demonstrate that our approach can outperform the previous work by a substantial margin of 6% in mAP (52.88% vs. 46.88%) and improve the Overall F1-score by 10% (0.770 vs. 0.700). This provides more meaningful and personalized privacy recommendations, setting a new standard for user-centric privacy protection systems.

Keywords:

visual privacy protection; personalized privacy; image recognition; graph convolutional networks; federated learning; reinforcement learning

1. Introduction

In the contemporary digital landscape, visual content has become the lingua franca of social interaction, with billions of images shared daily across online platforms. This torrent of user-generated content, however, creates a significant attack surface for privacy breaches, where seemingly innocuous photographs can inadvertently expose privacy attributes ranging from personal documents to geolocation data. This phenomenon, often termed the privacy paradox, underscores a critical gap between users’ desire for social engagement and their ability to manage the associated privacy risks effectively [,].

To address this, the field of automated visual privacy assessment has gained traction, typically formulating the problem as an image recognition task. Seminal studies, such as that by Jiang et al. [], have laid important groundwork by employing deep learning models, like ResNet (Residual Network) [] combined with a custom attention mechanism called PSSA (Privacy-Specific Spatial Attention), to detect privacy attributes. However, we argue that such approaches are constrained by outdated assumptions and suffer from three limitations:

Legacy architectures: Previous approaches are built only upon legacy architectures, which lack the powerful feature representation capabilities of modern state-of-the-art (SOTA) vision backbones like Vision Transformer (ViT) [].
Failure to model label correlations: Previous approaches fail to model the label correlations and co-occurrence patterns that exist between them (e.g., the “passport” attribute is highly correlated with “face”).
Static user preference models: Their user preference models are typically static and collective, derived from one-off user studies, which cannot adapt to the dynamic and deeply personal nature of individual privacy boundaries.

This paper posits that a truly effective visual privacy system must evolve from a static prediction tool into a dynamic, personalized, and explainable governance partner. We propose a holistic framework that systematically addresses the aforementioned limitations. Our main contributions are threefold:

To overcome legacy architectures, we establish a new SOTA baseline by employing modern ConvNeXt [] backbone and advocate for a more balanced evaluation protocol that emphasizes not only mAP but also the Overall F1-score (OF1), which is more representative of real-world performance in privacy-sensitive applications.
To solve the problem of failure to model label correlations, we introduce Graph Convolutional Networks (GCNs) as a classifier head to explicitly leverage label correlations for more coherent and accurate predictions, a technique proven effective in general multi-label contexts [].
To replace the static user preference models, we design a dynamic and privacy-preserving framework that integrates Federated Learning (FL) [] and Reinforcement Learning (RL) []. FL ensures that user preference models are trained on-device to preserve privacy, while RL allows the system to continuously and dynamically adapt to each user’s evolving privacy feedback, treating them as a unique individual.

Through extensive experimentation on the VISPR dataset, we demonstrate that our framework not only sets a new baseline in predictive performance but also provides a more robust, personalized, and interpretable solution for visual privacy governance.

2. Related Work

Our research is positioned at the intersection of several key domains in artificial intelligence and privacy. In this section, we review the literature on visual privacy prediction, multi-label recognition with label correlations, personalized and privacy-preserving learning.

2.1. Visual Privacy Prediction

The automated detection of privacy attributes in images has been a long-standing research goal. Early approaches often relied on handcrafted features and traditional machine learning models like SVM. With the advent of deep learning, the field saw a significant performance leap. Orekondy et al. [] introduced the VISPR dataset and proposed a framework for predicting privacy scores based on user preferences. Following this, Jiang et al. [] attempted to improve fine-grained attribute detection by PSSA onto a ResNet backbone. However, these pioneering works often exhibit two major limitations. First, their predictive power is fundamentally capped by the feature representation capabilities of their underlying backbones, which have since been surpassed by modern architectures like ViT and ConvNeXt. Second, their evaluation methodologies tend to focus heavily on mAP, often overlooking the critical trade-off between precision and recall, which is essential for practical deployment.

2.2. Modern Architectures for Visual Recognition

The evolution of backbones is a driving force behind progress in computer vision. For years, the field was dominated by Convolutional Neural Networks (CNNs), whose hierarchical structure of stacked convolutional layers proved highly effective for feature learning. This era culminated in the development of ResNet. Its critical innovation was the introduction of the residual block (or shortcut connection), which allowed gradients to bypass layers and flow more easily during backpropagation. This solution effectively mitigated the vanishing gradient problem, enabling the successful training of networks hundreds or even thousands of layers deep and establishing ResNet as the standard backbone for years.

A major paradigm shift arrived with ViT, which adapted the highly successful Transformer architecture from Natural Language Processing (NLP) to computer vision []. ViT challenged the convolutional approach by splitting an image into a sequence of fixed-size patches, treating them as tokens. These tokens are then processed by a standard Transformer encoder, where the core component is the self-attention mechanism []. The key advantage of self-attention is its ability to weigh the importance of all other patches when processing a single patch, allowing it to capture long-range dependencies and a global contextual view of the image from the very first layer. This provided a powerful new way to learn feature representations, especially on large-scale datasets. The success of ViT spurred a “convolutional renaissance,” where researchers began to re-examine the design of CNNs through the lens of Transformers. This led to the creation of ConvNeXt, a purely convolutional model that achieves Transformer-level performance. The ConvNeXt authors systematically “modernized” a standard ResNet by incrementally adopting architectural choices and training strategies that were key to the Transformer’s success []. These included adopting a multi-stage design with a changing number of blocks, using larger kernel sizes (e.g., 7 × 7), incorporating the inverted bottleneck structure, and leveraging modern activation functions (GELU) and optimizers (AdamW). The remarkable performance of ConvNeXt demonstrated that many of the Transformer’s gains were attributable to these specific design choices, rather than solely the self-attention mechanism itself. For a fine-grained task like identifying varied and often subtle privacy attributes, the superior and robust feature representations offered by this architectural evolution are critical. This motivates our work to move beyond the legacy backbones used in prior privacy research and establish a new performance baseline.

2.3. Multi-Label Recognition with Label Correlations

Visual privacy prediction is inherently a multi-label image recognition (MLR) task, a significant body of work in the broader MLR field has recognized that treating labels as independent entities is a suboptimal assumption. Real-world labels, including privacy attributes, often exhibit strong label correlations. To capture these dependencies, Graph Neural Networks (GNNs) have emerged as a powerful tool. Chen et al. [] first proposed using GCNs to model label correlations, building a graph over class labels to propagate information and refine predictions. Subsequent works have advanced this area by learning semantic-specific graph representations [] or designing attention mechanisms that are consistent across image transforms []. Despite the success of GNNs in general MLR, their application to the specific, high-stakes domain of visual privacy remains largely unexplored. Our work bridges this gap by introducing a GCN-based classifier head specifically designed to model the relational structure of privacy attributes, thereby enhancing prediction consistency and accuracy [,].

2.4. Personalized and Privacy-Preserving Learning

Static, one-size-fits-all models fail to capture the personal and dynamic nature of privacy. While the work of [,] attempted personalization through one-off user studies, this approach is not scalable or adaptive. The frontier of personalization lies in dynamic on-device learning. Federated Learning has become the standard for privacy-preserving collaborative training, allowing models to be trained on user data without centralizing it [,]. However, standard FL creates a single global model, failing to cater to individual user heterogeneity. To address this, the concept of Federated Reinforcement Learning (FRL) has recently gained prominence. FRL combines the privacy-preserving nature of FL with the dynamic and adaptive decision-making capabilities of RL. It is particularly well-suited for our task, where a system can learn a user’s privacy preferences over time by treating its recommendations as actions and the user’s feedback as rewards. While FRL has been explored in domains like resource allocation and autonomous driving, its application to personalized privacy governance is a novel contribution of our work. We propose a framework where an RL agent learns a user’s policy on-device within a federated setting, creating a truly adaptive and privacy-respecting system.

3. Proposed Methods

To address the limitations of prior work, we propose a framework for visual privacy governance that is performant, robust, and personalized. As illustrated in Figure 1, our framework is composed of three main stages: (1) A core privacy attribute recognition module that leverages a new vision backbone and GCN to model label correlations. (2) A dynamic personalization module that uses FRL to adapt to individual user preferences while preserving privacy. (3) A dynamic risk assessment module that synthesizes the model’s predictions and the user’s preferences into an intuitive risk score.

Figure 1. Overall architecture of our proposed framework.

3.1. Privacy Attribute Recognition

The foundation of our framework is a powerful and balanced recognition model; we improve upon existing methods in two key aspects: the feature extractor and the classifier design. Instead of relying on ResNet-50, we employ a mkilodern vision model ConvNeXt as our backbone feature extractor

ϕ (\cdot)

. Given an input image I, it extracts a high-dimensional feature map

Z = ϕ (\cdot)

. This provides a richer and more powerful feature representation, which is crucial for identifying subtle privacy attributes. The entire backbone is fine-tuned on the target dataset using a weighted loss function to mitigate the effects of class imbalance. Recognizing that privacy attributes are not independent, we replace the conventional linear classifier with a GCN head to explicitly model label correlations. We construct a directed graph

G = (V, E)

, where each node

v_{i} \in V

corresponds to one of the

C

privacy attributes. The edges

E

represent the conditional probability between attributes, estimated from the training set. The adjacency matrix

A \in R^{C \times C}

is defined as:

A_{i j} = P ({l a b e l}_{j}| {l a b e l}_{i})

(1)

This matrix captures the likelihood of attribute

j

appearing given that attribute

i

is present. We operate on the assumption that this statistical co-occurrence, derived from the training data, serves as a strong proxy for the underlying label correlations between privacy attributes. For instance, the high co-occurrence of ‘passport’ and ‘face’ labels reflects their inherent semantic link. Therefore, the graph G constructed from these statistical relationships provides the structural priors needed to guide the GCN toward making semantically coherent predictions. The initial node features,

H^{(0)} \in R^{C \times D}

, are derived from the image features Z. The layer-wise propagation rule is:

H^{(l + 1)} = σ (\hat{A} H^{(l)} W^{(l)})

(2)

where

\hat{A}

is the normalized adjacency matrix,

W^{(l)}

is the trainable weight matrix of the l-th layer, and σ is an activation function. The output of the final GCN layer,

H^{(L)}

, which now contains correlation-aware label representations, is used to produce the final multi-label predictions.

3.2. Dynamic and Personalized Governance via FRL

To overcome the static, one-size-fits-all nature of previous user preference models, we introduce a dynamic personalization module based on a synthesis of Federated Learning and Reinforcement Learning. The methodology is composed of three key components: a simulated learning environment, the agent’s architecture, and a two-stage learning framework.

3.2.1. Simulation Environment

To train and evaluate the personalization agents without requiring real user data, we developed a simulation environment based on the OpenAI Gym interface []. The state space is a continuous vector representing the 68 privacy attribute probabilities predicted by our main GCN model for a given image. The action space is discrete, where an action corresponds to selecting one of the 68 attributes to recommend for user attention (e.g., to be blurred or reviewed). The agent’s goal is to select the action that is most relevant to the user’s preferences. Then, to simulate diverse user preferences, we created five distinct user personas: Techie, Social, Family, Financial, and PrivacyGrade. Each persona is defined by a set of keywords (e.g., “financial” keywords include ‘card’, ‘receipt’, ‘bank’). The environment automatically calculates a reward based on the agent’s action. Actions that correctly identify an attribute matching the active persona’s keywords receive a higher positive reward, while incorrect suggestions for these privacy attributes incur a larger penalty, thus guiding the agent to learn the persona’s specific concerns.

3.2.2. Agent Architecture

The core of our personalization module is a Deep Q-Network (DQN) agent []. The agent’s neural network architecture is a Multi-Layer Perceptron (MLP) with two hidden layers (512 and 256 neurons, respectively), LeakyReLU activation functions, and Dropout layers with a rate of 0.5 for regularization. This network takes the state vector as input and outputs a Q-value for each possible action.

3.2.3. Two-Stage Learning Framework

Our framework employs a two-stage process to effectively train the agents:

Stage 1: Federated Pre-training: We first train a set of initial preference models using the Federated Averaging (fed_avg) algorithm [,]. In this stage, each client’s DQN agent is trained locally in a supervised manner, using a standard BCEWithLogitsLoss function to learn the general preferences of its corresponding user persona. This pre-training provides the agent with a strong baseline understanding of user preferences before any real-time interaction.

Stage 2: Online Fine-tuning from Corrective Feedback: During the interactive evaluation phase, the agent adapts based on specific corrective feedback rather than a simple scalar reward. When a user’s feedback is received (simulated as a list of correctly and incorrectly flagged attributes), the agent undergoes online fine-tuning. This process uses a weighted BCEWithLogitsLoss, where the loss for correcting a mistake is given a significantly higher weight (weight = 10.0) than the weight for reinforcing a correct suggestion (weight = 5.0). This asymmetrical weighting scheme ensures that the agent learns rapidly from its errors. Furthermore, this fine-tuning is only applied to attributes where the agent’s initial prediction confidence exceeded a threshold of 0.1, preventing updates on uncertain predictions. The Online Fine-tuning from Corrective Feedback algorithm is shown in Algorithm 1.

A critical aspect of our framework is the seamless integration of the dynamic personalization module with the core privacy attribute recognition module. The DQN agent acts as an intelligent post-processing, personalized re-ranking module. The workflow is as follows: (1) For a given image, the ConvNeXt-GCN classifier first produces an initial vector of attribute probabilities, representing an objective, user-agnostic analysis of the content. (2) This probability vector is then passed as the input state to the user’s personalized DQN agent. (3) The agent, based on its learned policy reflecting the user’s unique preferences, outputs a vector of Q-values. Each Q-value,

Q_{u} (s, a_{i})

, quantifies the expected future reward of flagging attribute i for user u, effectively serving as a learned, context-aware importance weight. This ensures that the personalization is built upon a robust visual understanding provided by the core classifier.

Algorithm 1: Online Fine-tuning from Corrective Feedback

Require: Agent

π

, current state s, user feedback

F = {F_{c o r r e c t}, F_{w r o n g}}

, learning rate

η,

episodes e, confidence threshold

θ_{c o n f}

Ensure: Updated agent

π'

1: Initialize optimizer for agent

π

with learning rate

η

.

2: Get prediction confidences P from state s

3:

I_{v a l i d}

←

{i ∣ P_{i} > θ_{c o n f}}

//Select indices of high-confidence predictions

4: if

I_{v a l i d}

is empty then return

π

5: Initialize target vector T and weight vector W for indices in

I_{v a l i d}

6: for each index I

\in

I_{v a l i d}

do
7: if I

\in

F_{c o r r e c t}

then
8:

T_{i}

←1.0,

W_{i}

←5.0//Reinforce correct suggestions
9: else if I

\in

F_{w r o n g}

then
10:

T_{i}

←0.0,

W_{i}

←10.0//Penalize incorrect suggestions more heavily
11: end if
12: end for
13: Initialize loss function

L

= BCEWithLogitsLoss(weight = W)
14: for e = 1 to E do
15: Get prediction scores

S_{p r e d}

←

π (s)

16:

S_{m a s k e d}

← select scores from

S_{p r e d}

with indices

I_{v a l i d}

17: loss←

L

(

S_{m a s k e d}

, T)
18: Update agent

π

by minimizing loss Via gradient descent
19: end for
20: return updated agent

π'

3.3. Dynamic Risk Assessment

To provide users with a clear and actionable privacy evaluation, we introduce a dynamic risk assessment module that leverages the output of the personalized DQN agent. This module moves away from abstract quantification methods in favor of a transparent scoring mechanism rooted in the agent’s learned policy. Instead of learning a static weight vector, the DQN agent for a given user u outputs a Q-value vector,

Q_{u} (s, \cdot)

, for any given state s (the image’s predicted attribute probabilities). Each element in this vector,

Q_{u} (s, a_{i})

, represents the agent’s learned expectation of future reward for taking the action ai (i.e., flagging the i-th privacy attribute). The final, personalized Risk Score (

S_{u}

) is defined as the maximum Q-value in this output vector:

S_{u} = \max_{i} Q_{u} (s, a_{i})

(3)

This approach ensures the risk score is both intuitive and deeply personal. A high score signifies that the agent has identified at least one attribute that it strongly believes is highly relevant to the user’s specific privacy concerns, providing a meaningful and actionable signal for sharing decisions. To ensure this score is not a black box, the entire Q-value vector is used to generate a ranked list of detected attributes, sorted by personal relevance. Furthermore, a bounding box is used to visually pinpoint this attribute’s location in the image, providing direct visual evidence.

To illustrate this mechanism in practice, Figure 2 and Figure 3 provide qualitative examples of the system’s user-facing output. Figure 2 demonstrates the detection of sensitive attributes like faces and semi-nudity in a complex public scene, while Figure 3 showcases its effectiveness in identifying personal data within a passport. It is important to note that our model performs image-level, multi-label classification; consequently, the score shown for a given label (e.g., a12_semi_nudity at 0.4464 in Figure 2) represents the overall confidence for that attribute’s presence anywhere in the image. The combination of a ranked risk list (the “why”) and illustrative visual indicators (the “where) transforms the system from a black-box detector into a transparent governance partner, delivering the clear, actionable feedback necessary to build user trust.

Figure 2. Qualitative Example of Privacy Detection in a Complex Scene. This figure illustrates the system’s ability to identify multiple privacy attributes in a crowded public environment, such as complete faces (a9_face_complete), partial faces (a10_face_partial), and semi-nudity (a12_semi_nudity).

Figure 3. Qualitative Example of Privacy Detection in a Sensitive Document. This example showcases the framework’s effectiveness in analyzing sensitive documents containing Personally Identifiable Information. The system identifies the document as a passport (a31_passport) and the embedded photograph as a complete face (a9_face_complete) with very high confidence.

4. Experimental

4.1. Experimental Setup

To ensure fair, head-to-head comparison and reproducibility with prior art, our experiments are conducted on the VISPR dataset. All images are resized to 224 × 224 pixels. For training, we apply standard data augmentation techniques, including random resized cropping, horizontal flipping, and color jittering. For validation and testing, we use a single center crop. All images are normalized using the mean and standard deviation of ImageNet. Figure 4 illustrates the distribution of positive samples for each privacy attribute across the entire VISPR dataset and its respective training, validation, and test splits. The visualizations reveal two primary challenges inherent to this dataset:

Figure 4. Distribution of positive samples for each privacy attribute across different data splits in the VISPR dataset. The charts illustrate the severe class imbalance and few-shot learning challenges.

Extreme Class Imbalance: The dataset exhibits a classic long-tailed distribution. Common attributes such as a17_color and a4_gender have over 10,000 positive samples in total, providing enough data for training. In contrast, many critical privacy attributes are extremely rare. For example, the a7_fingerprint attribute has only 46 samples in the entire dataset, and a21_full_nudity has only 51.

Severe Few-Shot Learning Problem: This imbalance directly leads to a few-shot learning scenario for many crucial classes. As shown in the training set distribution, the model has very limited data to learn from for the most sensitive privacy attributes. For instance, there are only 24 training samples for a7_fingerprint, 21 for a21_full_nudity, 21 for a79_address_home_partial, and 33 for a32_drivers_license.

These data characteristics pose a significant challenge, as a model trained on such imbalanced data can become biased towards the majority classes while failing to reliably detect the rare but critical ones. While our evaluation is conducted on the VISPR dataset, its selection is deliberate for validating the robustness and real-world applicability. VISPR is not only the standard benchmark used in seminal works, enabling a direct and fair comparison, but also a well-design dataset encapsulated the core challenges of real-world visual privacy governance. Its 68 fine-grained attributes are highly representative of the diverse privacy risks found in user-generated content on Internet. More importantly, the severe class imbalance and few-shot learning problems inherent in the dataset serve as a realistic proxy for the natural rarity of sensitive privacy attributes in the real-world. Many researchers have shown strong performance on VISPR to prove its robustness against long-tailed distributions and its capability to detect infrequent yet critical privacy threats, thus convincingly demonstrating its utility for the target application. As a single dataset cannot fully capture all aspects of robustness, extending evaluation to additional datasets and cross-domain scenarios is an important direction for future work. To mitigate this, we employ a weighted loss function during training, as detailed in Section 4.3.

4.2. A New SOTA Architectures with ConvNeXt-Base Model

To validate our first contribution of establishing a new SOTA baseline with modern architectures, we compare our fully optimized model against the previous method, ResNet-50+PSSA. Our model utilizes a ConvNeXt-Base model pretrained on ImageNet as its backbone for feature extraction. The output dimension of the CNN backbone is 1024. A key feature of our architecture is a dual-head classifier design. In addition to a standard linear classifier head, we incorporate a GCN head to model label correlations. The final prediction is a weighted ensemble of the outputs from both heads, where the weights (alpha and beta) are learnable parameters optimized during training. The learned weights from our best model were alpha = 1.1119 and beta = 0.9838.

For the 2 layered GCN components:

Label Embeddings: We use pretrained CLIP text embeddings (512-dim) as the initial node features for the attributes, capturing rich semantic meaning [].

Graph Construction: The adjacency matrix, which defines the label graph structure, is constructed using a PMI-KNN approach. We compute the Positive Pointwise Mutual Information (PMI) between all label pairs from the training set and then build a sparse graph by connecting each label to its k = 15 nearest neighbors []. A self-loop weight of p = 0.25 is applied to each node. This method proved more effective than a simpler binary adjacency matrix based on a conditional probability threshold (t = 0.4).

The model is trained for 15 epochs with a batch size of 36. We use the Adam optimizer with a learning rate of 1 × 10⁻⁴ and a weight decay of 1 × 10⁻⁴ []. The learning rate is managed by a cosine decay scheduler with a 3-epoch linear warmup. To handle the severe class imbalance inherent in the dataset, we employ the BCEWithLogitsLoss criterion with positive class weighting (pos_weight), where weights are inversely proportional to class frequencies. To accelerate training, we utilize Automatic Mixed Precision (AMP). For evaluation, we report mean Average Precision (mAP), overall precision (OP), overall recall (OR), and overall F1-score (OF1). We use standard terminology: TP = true positives, FP = false positives, and FN = false negatives. mAP is computed by first obtaining, for each class, the average precision (AP) as the area under that class’s precision–recall curve using the ranked prediction scores without thresholding; mAP is the mean of AP over all classes. OP/OR/OF1 are overall metrics, obtained by aggregating TP, FP, and FN over all images and classes before computing precision, recall, and their harmonic mean. Notably, to achieve optimal OF1, we determine a per-class decision threshold on the validation set and apply these thresholds to the test set.

As shown in Table 1, our method has outperformed [] by 6% in mAP. Furthermore, it demonstrates a deliberate and favorable performance trade-off. It achieves a substantial 14.7 percentage point increase in OR at the cost of 0.007 drop in OP. This design, which errs on the side of caution, is critical for privacy applications where failing to detect a risk is significantly more costly than a false alarm. Consequently, the OF1 is the most pertinent metric, as it reflects this essential balance between risk sensitivity and practical usability.

Table 1. Comparison with the previous method on the VISPR dataset.

4.3. Structured Label Correlations with GCN

To demonstrate our second contribution, the effectiveness of using GCN to model label correlations, we compared the performance of the linear head (CNN only), the GCN head (GCN only), and the final weighted ensemble model. A powerful backbone is fundamental to extracting rich visual features. To select the best-performing backbone, we evaluated several mainstream pretrained models on our task. The Configuration column specifies the classification head used (Linear or GCN) and, where applicable, the training protocol (e.g., Oversampling + Dropout). As shown in Table 2, models based on the ConvNeXt architecture consistently demonstrated the highest performance ceiling, particularly in mAP and OR metrics. Its modern convolutional design proved highly effective for this fine-grained recognition task. Therefore, we selected ConvNeXt-Base as the backbone for all subsequent optimization experiments.

Table 2. Performance comparison of different backbones.

After selecting ConvNeXt-Base as the backbone, we conducted further studies to determine the optimal configuration for our GCN head, focusing on the label embedding method and the graph construction strategy. The results shown in Table 3 indicate that while using GloVe embeddings with a binary adjacency matrix yielded the highest mAP, the combination of CLIP embeddings and a PMI-KNN graph achieved the best OF1 score (0.770). We attribute this to CLIP’s richer semantic representations and PMI-KNN’s more precise modeling of label correlations. Given our emphasis on building a balanced and reliable system, we selected this combination for our final model.

Table 3. Performance comparison of different GCN component configurations.

We validate our core architectural design: a dual-head that combines a standard linear classifier with our optimized GCN classifier. The results shown in Table 4 clearly indicate the complementary nature of the two heads. The CNN-only head is a high-recall, low-precision detector, while the GCN-only head is a high-precision, low-recall classifier. Our model successfully synergizes these strengths, achieving the best performance across both mAP and, crucially, the OF1. We posit that the GCN head functions not as a standalone predictor, but as a semantic consistency regulator. The CNN-only head, operating directly on rich visual features, acts as a high-recall, low-precision detector; it excels at proposing a wide range of potential privacy attributes but is prone to making uncorrelated errors. Conversely, the GCN-only head demonstrates high-precision but low-recall characteristics. This is because it lacks direct visual grounding and primarily makes decisions based on the relational structure of the label graph.

Table 4. Head ablation on the VISPR dataset.

Therefore, the abysmal standalone performance of the GCN head is expected. Its contribution materializes in the ensemble, where it modulates the initial, high-recall predictions from the CNN. By leveraging the learned label correlations, the GCN can suppress semantically implausible predictions and amplify combinations of attributes that are logically coherent, thereby correcting the CNN’s errors. This synergy allows our final model to successfully synergize these complementary strengths, achieving the best performance across both mAP and, crucially, the OF1 score.

To qualitatively demonstrate the effectiveness of our GCN head in modeling label correlations, we visualize and compare the prediction correlation matrices generated by the linear head versus the GCN head, as shown from Figure 5, Figure 6 and Figure 7. Figure 5 shows the prediction correlation from the standard linear head. As observed, outside of the main diagonal, the heatmap is largely unstructured and pale. This indicates that the predictions for different attributes are largely uncorrelated, as the model treats each label as an independent classification task. For instance, the model might predict a high probability for a31_passport without necessarily increasing its confidence for a9_face_complete. Figure 6 displays the label adjacency matrix, which encodes the co-occurrence relationships between attributes learned from the training data. This matrix serves as the structural prior for our GCN, with brighter spots indicating stronger correlations between attributes. Figure 7 shows the prediction correlation from our GCN head. In contrast to the linear head’s heatmap, the prediction correlation from our GCN head exhibits strong, block-like structures. These red blocks (positive correlation) clearly mirror the relationships defined in the adjacency matrix. For example, attributes related to personal identity like a31_passport, a9_face_complete, a20_name_first, and a21_name_last now show strong positive correlations in their predictions. This visually confirms that our GCN head has successfully learned to leverage the label graph to produce structured and logically coherent predictions. This visual evidence strongly supports our claim that incorporating a GCN is a crucial step toward building a more intelligent and context-aware privacy detection system.

Figure 5. Unstructured Predictions from the Linear Head. This heatmap visualizes the prediction correlations from the CNN-only classifier. The lack of strong off-diagonal patterns indicates that the model treats each privacy attribute independently, failing to capture their real-world relationships.

Figure 6. The PMI-KNN label adjacency matrix used by the GCN, which visualizes the co-occurrence relationships between attributes and serves as structured prior knowledge.

Figure 7. Structured Predictions from the GCN Head. In contrast to the linear head, the GCN head produces highly structured predictions, as shown by the distinct block-like correlations. These patterns closely mirror the adjacency matrix in Figure 4, visually confirming that the GCN successfully leverages the label graph to enforce semantic consistency.

4.4. Dynamic Personalization with FRL

To evaluate the effectiveness of our FRL module in adapting to diverse user preferences, we designed a simulation-based experiment, as collecting real-world, long-term user feedback is impractical. We simulated five distinct user personas to represent a variety of privacy concerns: Techie, Social, Family, Financial, and PrivacyGrade. Each persona’s preferences are defined by a set of keywords that map to specific privacy attributes, allowing for automated reward generation. The overall training process followed the Federated Averaging paradigm over 5 communication rounds, where each client trained a local Deep Q-Network agent for 50 epochs before model aggregation. The final evaluation was conducted on a set of 100 images over 40 rounds of simulated interaction. In each round, the agent performed inference, received simulated feedback, and then underwent online fine-tuning driven by an asymmetrical reward system where incorrect suggestions were penalized more heavily (reward = −3) to encourage the agent to prioritize user satisfaction.

Figure 8 shows the average reward trajectory over 40 rounds. Despite the asymmetric penalty design pushing the absolute reward downward over time, the agent’s retrieval quality remains stable (and slightly improves) as seen in Figure 9, where Recall@N stays in the ~0.74–0.76 band for most rounds. This decoupling is expected because we weigh corrections more than confirmations, deliberately biasing the scalar reward to prioritize user satisfaction when mistakes occur. Most importantly, the multiplicative rule consistently outperforms additive fusion (Figure 10 vs. Figure 11). The multiplicative method reaches a best Top-N hit rate around 0.77 and stabilizes near 0.74–0.76 after early rounds, whereas the additive rule peaks near 0.71 and hovers around 0.69–0.70 thereafter. This confirms that consensus-seeking fusion amplifies agreements across clients while suppressing noisy local preferences, yielding higher and more stable personalization quality. Taken together, these results demonstrate that our FRL module learns persona-specific preferences online without centralizing user data, maintains strong retrieval quality under asymmetric feedback, and benefits materially from consensus-based (multiplicative) aggregation on the server. In practice, this means the system adapts to individual users while remaining robust to client heterogeneity, precisely the behavior the personalization layer is designed to achieve.

Figure 8. Average Reward Trajectory During FRL Personalization. The plot shows the average reward per round for the agent. The gradual downward trend is an expected outcome of our asymmetric feedback model, where incorrect suggestions are penalized more heavily than correct ones are rewarded.

Figure 9. Recall@N across rounds during FRL personalization. This plot demonstrates that despite the reward trend seen in Figure 6, the agent’s practical performance, measured by Recall@N, remains high and stable after the initial learning phase. This confirms that the agent is effectively learning and adapting to the user persona’s preferences throughout the interaction.

Figure 10. Top-N hit rate with multiplicative fusion (40 rounds). This graph shows the Top-N hit rate using the multiplicative aggregation method. The performance quickly stabilizes in a high range (0.74–0.76), demonstrating that this consensus-based approach effectively amplifies shared preferences across clients, leading to robust and superior personalization quality.

Figure 11. Top-N hit rate with additive fusion (40 rounds). In contrast to the multiplicative method in Figure 8, the additive fusion approach shown here achieves a lower and less stable Top-N hit rate, hovering around 0.69–0.70. This comparison highlights the superiority of the consensus-seeking multiplicative approach for handling diverse user preferences in a federated setting.

4.5. Computational Cost

To validate the real-world applicability of our proposed framework, particularly its feasibility for deployment on resource-constrained edge devices within a federated learning setting, we conducted a comprehensive evaluation of its computational cost. This section presents a detailed analysis of three key aspects: model size, training and fine-tuning costs, and inference latency. All performance benchmarks were conducted on a desktop computer with Intel i5-9400F CPU, RTX 3060 12 GB GPU, 32 GB RAM. In a federated learning architecture, the computational load is strategically divided between server-side initial training and client-side continuous fine-tuning. On our experimental platform, training the core model for 15 epochs took 1 h, 8 min, and 53 s, with a peak memory usage of 1905.33 MB. This phase corresponds to the server-side training in a federated setup. As a one-time, offline process, it does not consume resources on the end-user’s device. We simulated the on-device personalization scenario by performing 40 rounds of updates using 100 images. Low inference latency is essential for a seamless user experience. We measured the inference latency of our framework with the following results:

Average Classifier Latency: 19.1322 ms/image
Average RL Agent Latency: 0.5352 ms/image
Average Total Inference Latency: 19.6675 ms/image

The total inference latency of under 20 ms allows the system to operate at over 50 Frames Per Second, which is more than sufficient for real-time applications. Notably, the decision-making overhead of the RL agent constitutes only 2.7% of the total latency, demonstrating that our framework can provide sophisticated personalization with minimal performance impact.

Our analysis shows that the proposed framework achieves an excellent balance between computational performance and deployment feasibility. By design, the most resource-intensive training tasks are offloaded to a central server, while the on-device operations remain lightweight and efficient. The experimental data confirms that our framework possesses a manageable model size, highly efficient personalization fine-tuning, and outstanding real-time inference speed, validating its feasibility for deployment on resource-constrained edge devices.

5. Conclusions and Future Works

In this paper, we proposed a comprehensive visual privacy governance framework to address the limitations of prior work: legacy architectures, failure to model label correlations, and static user preference models. Our framework utilizes a modern ConvNeXt backbone with an innovative dual-head GCN classifier to model label correlations, alongside a FRL module for dynamic privacy-preserving. Experiments on the VISPR dataset show that our recognition model can outperform the previous method by 6% in mAP while achieving a 10% higher OF1.

Looking ahead, our primary focus is to build upon this foundation to enhance its real-world applicability. First, to confirm the generalization of our core recognition model, future work will involve evaluation of additional privacy datasets and diverse real-world collections. A second key priority is to validate the FRL module beyond simulated personas by conducting controlled user studies with real-world feedback, a task for which our privacy-preserving federated architecture is inherently well-suited. Furthermore, to improve reliability, future iterations must address the challenges of severe class imbalance and concept drift. We plan to explore advanced few-shot learning methodologies for rare attributes and integrate continual learning frameworks to ensure the system can adapt to new privacy categories and evolving societal norms. In parallel, we will tighten on-device efficiency through model compression and extend the framework’s scope to video privacy by modeling spatiotemporal correlations, further broadening its impact on protecting user privacy in a dynamic digital world.

Author Contributions

Conceptualization, R.-I.C. and C.Y.; Methodology, R.-I.C. and C.Y.; Software, C.Y. and W.-X.L.; Validation, C.Y. and W.-X.L.; Formal Analysis, R.-I.C. and C.Y.; Investigation, R.-I.C. and C.Y.; Data Curation, C.Y. and W.-X.L.; Writing—Original Draft Preparation, R.-I.C. and C.Y.; Writing—Review and Editing, R.-I.C. and C.Y.; Visualization, R.-I.C. and C.Y.; Supervision, R.-I.C.; Project Administration, R.-I.C.; Funding Acquisition, R.-I.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The research presented in this study is based on the publicly available VISPR dataset, which can be accessed at https://tribhuvanesh.github.io/vpa/ (accessed on 15 August 2025). No new data were created during this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

mAP	mean Average Precision
OP	Overall Precision
OR	Overall Recall
OF1	Overall F1-score
GCN	Graph Convolutional Network
FL	Federated Learning
RL	Reinforcement Learning
FRL	Federated Reinforcement Learning
CNN	Convolutional Neural Network
ViT	Vision Transformer
ResNet	Residual Network
GNN	Graph Neural Network
MLR	Multi-Label Image Recognition
VISPR	Visual Privacy Dataset
PSSA	Privacy-Specific Spatial Attention
CLIP	Contrastive Language–Image Pre-training
GloVe	Global Vectors for Word Representation
PMI	Positive Pointwise Mutual Information
KNN	k-Nearest Neighbors
PMI-KNN	PMI-based kNN graph
MLP	Multi-Layer Perceptron
DQN	Deep Q-Network
AMP	Automatic Mixed Precision
BCEWithLogitsLoss	Binary Cross-Entropy with Logits Loss
pos_weight	positive class weighting
Adam	Adaptive Moment Estimation
AdamW	Adam with decoupled Weight Decay
GELU	Gaussian Error Linear Unit
LeakyReLU	Leaky Rectified Linear Unit
SVM	Support Vector Machines
NLP	Natural Language Processing

References

Gross, R.; Acquisti, A. Information revelation and privacy in online social networks. In Proceedings of the 2005 ACM Workshop on Privacy in the Electronic Society, Alexandria, VA, USA, 7 November 2005; pp. 71–80. [Google Scholar]
Kokolakis, S. Privacy attitudes and privacy behaviour: A review of current research on the privacy paradox phenomenon. Comput. Secur. 2017, 64, 122–134. [Google Scholar] [CrossRef]
Jiang, H.; Zuo, J.; Lu, Y. Connecting Visual Data to Privacy: Predicting and Measuring Privacy Risks in Images. Electronics 2025, 14, 811. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
Chen, Z.-M.; Wei, X.-S.; Wang, P.; Guo, Y. Multi-label image recognition with graph convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 5177–5186. [Google Scholar]
Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R. Advances and open problems in federated learning. Found. Trends Mach. Learn. 2021, 14, 1–210. [Google Scholar] [CrossRef]
Qi, J.; Zhou, Q.; Lei, L.; Zheng, K. Federated reinforcement learning: Techniques, applications, and open challenges. arXiv 2021, arXiv:2108.11887. [Google Scholar] [CrossRef]
Orekondy, T.; Schiele, B.; Fritz, M. Towards a visual privacy advisor: Understanding and predicting privacy risks in images. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3686–3695. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Chen, T.; Xu, M.; Hui, X.; Wu, H.; Lin, L. Learning semantic-specific graph representation for multi-label image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 522–531. [Google Scholar]
Guo, H.; Zheng, K.; Fan, X.; Yu, H.; Wang, S. Visual attention consistency under image transforms for multi-label image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 729–739. [Google Scholar]
Lanchantin, J.; Wang, T.; Ordonez, V.; Qi, Y. General multi-label image classification with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16478–16488. [Google Scholar]
Zhao, J.; Yan, K.; Zhao, Y.; Guo, X.; Huang, F.; Li, J. Transformer-based dual relation graph for multi-label image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 163–172. [Google Scholar]
Pati, S.; Kumar, S.; Varma, A.; Edwards, B.; Lu, C.; Qu, L.; Wang, J.J.; Lakshminarayanan, A.; Wang, S.-h.; Sheller, M.J. Privacy preservation for federated learning in health care. Patterns 2024, 5, 100974. [Google Scholar] [CrossRef] [PubMed]
Brockman, G.; Cheung, V.; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. Openai gym. arXiv 2016, arXiv:1606.01540. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Fallah, A.; Mokhtari, A.; Ozdaglar, A. Personalized federated learning: A meta-learning approach. arXiv 2020, arXiv:2002.07948. [Google Scholar] [CrossRef]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Yao, L.; Mao, C.; Luo, Y. Graph convolutional networks for text classification. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; pp. 7370–7377. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Fang, Y.; Wang, W.; Xie, B.; Sun, Q.; Wu, L.; Wang, X.; Huang, T.; Wang, X.; Cao, Y. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 19358–19369. [Google Scholar]

Figure 1. Overall architecture of our proposed framework.

Figure 2. Qualitative Example of Privacy Detection in a Complex Scene. This figure illustrates the system’s ability to identify multiple privacy attributes in a crowded public environment, such as complete faces (a9_face_complete), partial faces (a10_face_partial), and semi-nudity (a12_semi_nudity).

Figure 3. Qualitative Example of Privacy Detection in a Sensitive Document. This example showcases the framework’s effectiveness in analyzing sensitive documents containing Personally Identifiable Information. The system identifies the document as a passport (a31_passport) and the embedded photograph as a complete face (a9_face_complete) with very high confidence.

Figure 4. Distribution of positive samples for each privacy attribute across different data splits in the VISPR dataset. The charts illustrate the severe class imbalance and few-shot learning challenges.

Figure 5. Unstructured Predictions from the Linear Head. This heatmap visualizes the prediction correlations from the CNN-only classifier. The lack of strong off-diagonal patterns indicates that the model treats each privacy attribute independently, failing to capture their real-world relationships.

Figure 6. The PMI-KNN label adjacency matrix used by the GCN, which visualizes the co-occurrence relationships between attributes and serves as structured prior knowledge.

Figure 7. Structured Predictions from the GCN Head. In contrast to the linear head, the GCN head produces highly structured predictions, as shown by the distinct block-like correlations. These patterns closely mirror the adjacency matrix in Figure 4, visually confirming that the GCN successfully leverages the label graph to enforce semantic consistency.

Figure 8. Average Reward Trajectory During FRL Personalization. The plot shows the average reward per round for the agent. The gradual downward trend is an expected outcome of our asymmetric feedback model, where incorrect suggestions are penalized more heavily than correct ones are rewarded.

Figure 9. Recall@N across rounds during FRL personalization. This plot demonstrates that despite the reward trend seen in Figure 6, the agent’s practical performance, measured by Recall@N, remains high and stable after the initial learning phase. This confirms that the agent is effectively learning and adapting to the user persona’s preferences throughout the interaction.

Figure 10. Top-N hit rate with multiplicative fusion (40 rounds). This graph shows the Top-N hit rate using the multiplicative aggregation method. The performance quickly stabilizes in a high range (0.74–0.76), demonstrating that this consensus-based approach effectively amplifies shared preferences across clients, leading to robust and superior personalization quality.

Figure 11. Top-N hit rate with additive fusion (40 rounds). In contrast to the multiplicative method in Figure 8, the additive fusion approach shown here achieves a lower and less stable Top-N hit rate, hovering around 0.69–0.70. This comparison highlights the superiority of the consensus-seeking multiplicative approach for handling diverse user preferences in a federated setting.

Table 1. Comparison with the previous method on the VISPR dataset.

Method	Model	mAP (%)	OP	OR	OF1
[]	ResNet-50 + PSSA	46.88	0.720	0.690	0.700
Ours	ConvNeXt-Base + GCN	52.88	0.713	0.837	0.770

Table 2. Performance comparison of different backbones.

Backbone	Configuration	mAP (%)	OP	OR	OF1
ResNet-50	Oversample + Dropout	50.53	0.685	0.815	0.744
EVA-02 Base []	Baseline (AdamW)	48.40	0.815	0.730	0.770
Swin-Small	Baseline	50.64	0.811	0.726	0.766
ConvNeXt-Base	ConvNeXt-Base + GCN	52.88	0.713	0.837	0.770

Table 3. Performance comparison of different GCN component configurations.

Backbone	Label Embedding	Graph Construction	mAP (%)	OP	OR	OF1
ConvNeXt-Base	GloVe (300d)	Binary (tau = 0.4)	52.96	0.716	0.797	0.755
	GloVe (300d)	PMI-KNN (k = 15)	52.44	0.679	0.854	0.757
	CLIP (512d)	PMI-KNN (k = 15)	52.88	0.713	0.837	0.770

Table 4. Head ablation on the VISPR dataset.

Head	mAP (%)	OP	OR	OF1
CNN only (Linear)	44.01	0.234	0.934	0.374
GCN only	38.48	0.805	0.375	0.511
Ours	52.88	0.713	0.837	0.770

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Dynamic Visual Privacy Governance Using Graph Convolutional Networks and Federated Reinforcement Learning

Abstract

1. Introduction

2. Related Work

2.1. Visual Privacy Prediction

2.2. Modern Architectures for Visual Recognition

2.3. Multi-Label Recognition with Label Correlations

2.4. Personalized and Privacy-Preserving Learning

3. Proposed Methods

3.1. Privacy Attribute Recognition

3.2. Dynamic and Personalized Governance via FRL

3.2.1. Simulation Environment

3.2.2. Agent Architecture

3.2.3. Two-Stage Learning Framework

3.3. Dynamic Risk Assessment

4. Experimental

4.1. Experimental Setup

4.2. A New SOTA Architectures with ConvNeXt-Base Model

4.3. Structured Label Correlations with GCN

4.4. Dynamic Personalization with FRL

4.5. Computational Cost

5. Conclusions and Future Works

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics