1. Introduction
Workplace Safety is a crucial factor in construction and industrial contexts. Data provided by the International Labour Organisation (ILO) show that every year, 2.78 million workers lose their lives due to accidents at work and occupational diseases [
1]. There are several causes for this, such as failure to use PPE or incorrect use of PPE, lack of adequate governance and regulations, and misinformation about the importance of health prevention in the workplace. Over the last few years, the topic of workplace safety has also been analysed using Artificial Intelligence (AI) tools. In this vein, the issue of safety in industrial settings using AI techniques has been addressed in several studies. For instance, Isalovic et al. employed Convolutional Neural Networks (CNNs) to classify a dataset comprising twelve different types of PPE. Several key architectures were tested, including Faster R-CNN, MobileNetV2-SSD and YOLOv5; the results show that YOLOv5 achieves the best performance, with a recall rate of 0.61 [
2]. Han and Zeng conducted a binary analysis on the detection of items within the image set under examination and demonstrated that the average precision metric stands at 0.92 [
3]. Furthermore, anomaly detection in industrial settings has also been addressed through the use of Vision Transformers (ViT), architectures that adapt self-attention mechanisms to computer vision tasks, where each image is broken down into patches and transformed into an embedding capable of preserving the original spatial information. The core of the architecture enables the model to assess the relative importance of each patch compared to the others across the entire image, capturing global relationships and proving more efficient from the very first layer. For these reasons, the work by Han et al. [
4] appears extremely interesting: they have in fact proposed an integrated method based on ViT capable of processing spatio-temporal data relating to the monitoring of human activity in the context of surveillance.
Among emerging AI techniques, it is important to note that Federated Machine Learning (FML) has proven useful for analysing images of workers. In fact, in this context, FML enables classification and guarantees the privacy of sensitive information thanks to the process of data decentralisation, whereby the central server updates the weights sent by the local servers on which training takes place. This reduces the risk of sensitive data being exposed. For example, the study conducted by Di Renzo et al. [
5] proposes a binary classification relating to images of workers in safety and those who are not. In particular, the adoption of FML through a ViT architecture has helped to achieve an accuracy of 0.77.
Starting from these considerations, in this study, we propose a multi-class analysis (where the classes are helmet, reflective jacket, unsafe worker and safe worker) using an FML approach in which the explainability of the models is guaranteed by the use of heatmaps. The aim of the analysis is to compare the different heatmaps created, including Grad-CAM, attention, Rollout and CLS, and to demonstrate their reliability through similarity indices such as the Structural Similarity Index Measure (SSIM), Visual Information Fidelity (VIF) and Spatial Correlation Coefficients (SCCs). In this context, the article aims to provide tools capable of validating the consistency of the activation maps needed to understand which areas of the images have most influenced the model’s classification. In particular, an intra-model comparison (analysis of different heatmaps on a single model of client) and inter-model comparison (same heatmap for different models of clients) is performed. The inter- and intra-client comparisons demonstrate the consistency of the internal reasoning of each model. The rest of the article is structured as follows:
Section 2 describes the method adopted for the experimental analysis.
Section 3 describes the experimental analysis and the results relating to the similarity indices obtained by evaluating the different activation maps in the intra-client and inter-client study. Moreover,
Section 4 discusses the state of the art of the automatic PPE detection and, finally, in
Section 5, the conclusions and future work connected to the explainability techniques and quantitative analysis of PPE detection are discussed.
2. The Method
Figure 1 shows the proposed Federated Learning-based method employed for privacy-preserving PPE detection. The process begins with a central server that initializes a global base model and disseminates it to all participating client nodes. Each client operates within a privacy-sensitive workplace environment and acquires images from heterogeneous on-site surveillance systems, which may vary in resolution and device characteristics. To ensure consistency across these decentralized data sources, all images undergo a standardized pre-processing stage that resizes the inputs to
pixels prior to training. In the proposed configuration, the data is divided among the clients according to an IID (Independently and Identically Distributed) distribution, ensuring that each node analyses a uniform subset. The dataset used for the analysis consists of 1600 images; consequently, each subset comprises 160 images during the local training phase. With regard to the federated model in this paper, we consider the pretrained WinKawaks/vit-tiny-patch16-224 ViT model, freely available for research purposes on the HuggingFace repository (
https://huggingface.co/WinKawaks/vit-tiny-patch16-224 (accessed on 26 April 2026). This model processes each image by dividing it into fixed-size
patches, embedding them into a sequence of tokens, and propagating them through multiple layers of multi-head self-attention. The entire ViT backbone is frozen to preserve its robust, pre-learned visual representations and to prevent overfitting in client environments where training samples may be limited. A lightweight task-specific classification head is appended to the frozen backbone: it consists of a fully connected layer that projects the ViT output into a 128-dimensional latent space, followed by a ReLU activation, a dropout layer with a rate of
for regularization, and a final linear layer that outputs logits corresponding to the “safe”,“unsafe”, “vest” and “helmet” PPE classes: in particular, `safe’ refers to the group of workers who wear both a helmet and a jacket, whilst ‘unsafe’ refers to workers who do not use any PPE. The `helmet’ class consists solely of workers who wear only a helmet, without any other PPE; similarly, the ‘jacket’ class includes only workers who wear a jacket. Therefore, each group is associated with a single label. This hybrid design leverages the general-purpose feature extraction of the pretrained transformer while enabling fine-tuning for the downstream classification task.
In Listing 1, we provide a code snippet we developed in the Python programming language (3.9.0. version) programming language, which aims to show the ViT model implementation.
| Listing 1. Code snippet developed by authors related to the exploited ViT model. |
![Jimaging 12 00195 i001 Jimaging 12 00195 i001]() |
As shown in Listing 1, we employ a Vision Transformer (ViT), a transformer-based architecture specifically designed for image classification tasks. In this work, we make use of the pretrained
WinKawaks/vit-tiny-patch16-224 model provided by the Hugging Face
transformers library (
https://huggingface.co/WinKawaks/vit-tiny-patch16-224 (accessed on 26 April 2026), freely available for research purposes. The choice of a pretrained model, i.e., a model whose weights have been initialized through large-scale image classification tasks, is considered with the aim to ensure that generalizable visual features are already encoded in the proposed model.
Below, we summarize the instructions shown in the Python code snippet reported in Listing 1:
Feature Extraction with ViT:
The pretrained ViT model is first loaded and its parameters are frozen, preventing any weight updates during training. In this configuration, the ViT backbone is used strictly as a feature extractor. This approach, known as transfer learning, is particularly effective when the available dataset is limited or when pretrained visual representations are deemed sufficient for the downstream task.
Custom Classification Head: A custom fully connected (FC) classifier is appended on top of the frozen ViT encoder. This classifier consists of the following components:
A linear layer that maps the ViT hidden representation to a 128-dimensional feature vector.
A ReLU activation function to introduce non-linearity.
A dropout layer with a dropout probability of 0.3 to reduce overfitting.
A final linear layer that projects the 128-dimensional vector to the number of target classes (four in our case).
Once the global model has been broadcast to the clients, each client uses its own locally stored dataset, kept entirely on-device to maintain data confidentiality, with the aim to train the classification head using its local samples. Instead of sharing raw images, clients return only their updated model parameters to the central server. The server then aggregates these updates into an improved global model using either the classical Federated Averaging (FedAvg) algorithm or the Federated Proximal (FedProx) method. FedAvg computes a weighted average of client parameters according to local dataset sizes, while FedProx introduces a proximal constraint to stabilize training under non-IID client distributions by penalizing excessive deviation from the global model.
After aggregation, the refined global model is redistributed to the clients, and this cycle of local training and global updating continues for multiple federated rounds, progressively enhancing predictive performance and generalization across the distributed environment. Upon convergence, the final global model proceeds to an explainability stage in which a set of explainability techniques, i.e., attention, Grad-CAM, CLS and Rollout, is applied to a separate test set. These complementary methods generate heatmaps that highlight the image regions most influential to the model’s decisions, allowing verification that attention is correctly directed toward semantically meaningful PPE components such as helmets and reflective vests. This post-training explainability analysis ensures that the final model remains transparent, auditable, and consistent with safety-critical requirements while fully preserving the privacy of all client-side data throughout the entire workflow.
In the following, we provide a brief explanation related to the explainability techniques we considered.
The first one is the attention mechanism that, in ViTs, allows the model to focus on specific areas of an input image when making predictions. It works by calculating attention scores for each image patch in relation to others, essentially highlighting key regions of the image that influence the decision. These attention scores can then be visualized as attention maps, offering insight into which parts of the image the model prioritizes when making its prediction.
The second one is the Grad-CAM (Gradient-weighted Class Activation Mapping), which, in contrast, helps visualize which regions of an image are most influential in the model’s output. By computing the gradient of the output with respect to the final convolutional layer, Grad-CAM generates a weighted combination of feature maps that can be used to create a heatmap. This heatmap can be overlaid on the original image, visually showing which areas are most significant for the model’s decision.
The third one is the CLS (class token), that plays a critical role in ViT decision-making. This special token, added to the input image’s patch embeddings, captures global information about the image. After passing through the transformer layers, the final representation of this class token is used for classification, and by examining how the class token evolves across layers, we gain a clearer understanding of how the model processes the image and refines its predictions.
The fourth one is the Rollout, i.e., an advanced technique that helps visualize the self-attention maps within a transformer model. It builds on the concept of self-attention but extends it across multiple layers, providing a more comprehensive view of how attention is distributed and how different patches interact with one another. This iterative “rollout” process shows how information flows through the network, helping us to understand the relationships between patches and how the model integrates these dependencies to make its final decision.
The explainability obtained with the adoption of these four different techniques provides a visual understanding with regard to the model decision, i.e., a qualitative analysis related to explainability. With the aim of providing a quantitative analysis of explainability, we provide two categories of metrics that we compute with the aim of evaluating the explainability of the model from a quantitative point of view.
The first category is the intra-model one, referred to as the quantitative analysis related to the single client model, with the aim of evaluating how the different explainability techniques agree on the areas on the images detected as being of interest for a model of a client.
The second category is the inter-model one, referred to as the quantitative analysis related to a different client, with the aim of evaluating how two clients (and, thus, two different models) agree on the areas on the images detected as being of interest for both clients.
Once the two categories are defined,
Figure 2 and
Figure 3 show the workflow for the quantitative analysis for the intra-model and for the inter-model categories, respectively. Thus, after each client produces its explainability outputs using the final global model, we first perform an intra-model evaluation, shown in
Figure 2. In this stage, the explainability techniques are applied to the same server-trained model and the same input image. This allows us to compare how different explainability algorithms highlight relevant regions. High spatial overlap between the resulting heatmaps provides evidence of internal consistency in the model’s reasoning, confirming that distinct CAM algorithms converge on similar PPE-related features.
Figure 3 illustrates the inter-model explainability analysis. Here, we apply a single explainability technique to two models obtained from two different clients. This comparison focuses on whether the two federated models attend to the same image regions when making PPE compliance predictions.
In both the categories, i.e., the intra-model and the inter-model, we compute three different metrics (ranging from 0 to 1, where 1 is the best result) between the heatmaps: the Structural Similarity Index Measure (SSIM) and the Visual Information Fidelity (VIF) and Spatial Correlation Coefficient (SCC). SSIM evaluates perceptual similarity between two heatmaps in terms of luminance, contrast, and structural alignment, and it is defined as
where
and
denote the mean intensities,
and
the variances, and
the cross-covariance of heatmaps
x and
y, with
and
stabilizing constants. VIF, computed using the implementation
vifp from
sewar.full_ref, measures the amount of visual information preserved between two heatmaps using a wavelet-based natural scene statistics model:
where
is the gain factor between reference and distorted signals,
and
represent the signal and visual noise variances, and
denotes the natural noise variance. Finally, the Spatial Correlation Coefficient (SCC) was introduced, which measures the local spatial correlation between activation maps. This coefficient provides an assessment of the degree of alignment of spatial structures and the distribution of gradients between two heatmaps:
where
xi represents the pixel value in the first reference heatmap,
yi the pixel value in the second reference heatmap,
stands for the mean pixel value of the first heatmap,
stands for the mean pixel value of the second activation map, and finally
n represents the total number of pixels.
High SSIM, VIF and SCC values indicate strong agreement between CAM methods on the same model.
By combining intra-model and inter-model comparisons, the idea is to ensure that both the internal reasoning of each model and the consistency across models remain aligned with meaningful and safety-critical PPE features.
In this context, it should be noted that the aim of our work is to analyse the consistency between different explainability techniques and between federated models, rather than merely assessing semantic correctness against an annotated gold standard. In this regard, the SIM, VIF and SCC indices have been chosen as metrics for the quantitative assessment of similarity between explainability techniques on different clients and between different explainability techniques on the same client.
4. Related Work
The progressive increase in accidents in the workplace and the consequent monitoring of the correct use of PPE has led to the development of AI tools applied to the context in question. Several studies have investigated PPE detection using deep learning techniques, highlighting their applicability in real-world industrial environments [
1,
9,
10,
11,
12,
13].
The surveillance and monitoring of human actions remain a significant challenge for anomaly detection in industrial settings. For this reason, Han et al. [
4], in their work, proposed an integrated ViT-based method that processes spatio-temporal data, where the position-encoding module organises non-sequential information, whilst the transformer encoder efficiently compresses the features of sequential data to improve computational speed. Human activity recognition is achieved using a multi-layer perceptron (MLP) classifier.
In this context, the authors of [
1] proposed a real-time PPE detection system. Their study supports the effectiveness and importance of DL models in improving safety on construction sites through automated verification of the correct use of safety equipment. Additional contributions have shown the benefit of large-scale curated datasets for robust PPE recognition in diverse occupational environments [
6,
7,
8,
14].
Han and Zeng also addressed the issue of detecting safety helmets on construction sites using Deep Learning techniques. In this case, a dataset consisting of images of construction sites sourced from the internet was used; the study employed YOLOv5 as the main algorithm and incorporated a four-level detection method. The main evaluation metric was the mean average precision (mAP) [
3]. The study also focused on detecting whether workers were wearing helmets. The results of this study achieved an mAP of 0.92, which is better than using YOLOv5 alone, which achieved 0.86. However, the work described does not guarantee the privacy and security of sensitive data contained in the image datasets, nor does it provide an explainability technique capable of demonstrating the system’s robustness through visual representations of the areas that are critical for classification.
The use of CNNs in PPE classification was addressed by Isalovic et al. [
2], who tested various architectures including Faster R-CNN, MobileNetV2-SSD and YOLOv5. The proposed pipeline integrates the estimation of the area of interest on the head with a PPE detection system, using a dataset containing 12 different types of PPE. The results show that YOLOv5 achieves superior performance with a recall rate of 0.61.
A safety analysis, based mainly on the classification of images of workers on construction sites, was also conducted using FML, which proved necessary for privacy protection. For example, the FedSH framework considers a solution enabling collaboration across multiple worksites without centralising sensitive data, ensuring both scalability and privacy [
15,
16,
17,
18].
The superiority of FL over traditional learning techniques has also been demonstrated by Makris et al., who developed an innovative approach integrating FL with the YOLOv8 architecture for advanced PPE object detection [
19]. Their results indicate that FL offers strong performance while complying with strict requirements for privacy and data protection.
Although FML techniques are effective for the task at hand, model explainability remains crucial for understanding automated decisions. In this context, explainability techniques such as Grad-CAM [
20,
21] and transformer attention-based methods [
22,
23,
24] play a central role in ensuring transparency in monitoring systems.
For instance, the method proposed by Di Renzo et al. proposes a federated, privacy-preserving PPE detection system enriched with Grad-CAM visual explanations [
5,
25]. This integration helps reveal the discriminative image regions that most influence the model’s decisions, improving trust and safety in workplace environments.