1. Introduction
Federated learning [
1] has emerged as a pivotal paradigm for distributed machine learning, enabling collaborative training of models without the need to centralize data. This approach is particularly advantageous in scenarios where data privacy [
2], limited bandwidth, or constrained infrastructure hinder centralized data collection. In ecological monitoring, these challenges are common, especially in remote natural areas where edge devices—such as Raspberry Pi units equipped with lightweight machine learning models—can autonomously process and analyze data on-site.
To address the ongoing global decline in biodiversity, it is critical to adopt accurate and scalable monitoring strategies [
3]. Within this context, birds serve as essential indicators of ecosystem health, migratory behavior, and biodiversity shifts [
4,
5]. Federated learning offers an opportunity to develop species classification models without the need to transmit sensitive or voluminous data, thereby preserving privacy and reducing bandwidth requirements. This is particularly useful in protected natural parks, where internet connectivity is limited and devices collect unbalanced, heterogeneous video data across diverse contexts. To evaluate this paradigm, we developed a tailored federated learning configuration—comprising model selection, training hyperparameters, and deployment strategy—specifically optimized for real-world wetland bird classification using a newly constructed dataset of annotated videos recorded in the Valencian Community, Spain.
Our experiments leverage the Flower framework to simulate federated training of lightweight convolutional neural networks (CNNs) on edge devices. We investigate various architectures and hyperparameter settings (number of clients, communication rounds, and local epochs), analyzing their performance under non-IID data conditions. This study further compares aggregation strategies such as FedAvg and FedProx to assess their robustness in heterogeneous settings. Although our focus is on a specific Spanish wetland region, the approach generalizes to other ecological monitoring scenarios involving different species, climates, or edge device constraints.
This work highlights the trade-offs between accuracy, model complexity, and inference time in federated systems and provides actionable insights into the deployment of scalable and privacy-aware classification systems in the field. The key contributions of this study are
Empirical design and evaluation of a federated learning configuration tailored for ecological monitoring, including optimal model and hyperparameter combinations under non-IID data and resource constraints.
Benchmarking of lightweight neural network models (e.g., WideResNet50_2, EfficientNetV2, MnasNet) for bird species detection, considering performance, inference time, and deployment feasibility on Raspberry Pi devices.
Analysis of aggregation strategies (FedAvg vs. FedProx) and their implications for accuracy, model drift, and robustness in real-world federated training scenarios.
By keeping data localized and only exchanging model parameters, our federated learning setup reduces transmission overhead and enables distributed collaboration among decentralized devices. This makes it a practical and extensible solution for biodiversity monitoring across varied ecological and geographic conditions. The complete source code used in our simulations is publicly available (
https://github.com/3dperceptionlab/Federated_Learning_WetlandBirds) (accessed on 28 June 2025).
The remaining part of the paper is organized as follows:
Section 2 reviews the state of the art in federated learning and bird species recognition.
Section 3 explains the methodology followed to create the experimentation of this study.
Section 4 shows the results obtained from the experimentation. And
Section 6 summarizes the work developed and the conclusions extracted.
2. Related Works
2.1. Federated Learning
Federated learning represents a distributed machine learning paradigm wherein multiple computational nodes located across diverse geographic regions collaboratively develop a shared predictive model without transmitting local datasets to a centralized location. This use of federated learning ensures that all training data remain localized. This capability is pivotal in domains characterized by stringent data sensitivity and privacy requirements, including healthcare [
6], finance [
7], and IoT-based applications [
8,
9].
Federated learning systems typically consist of two principal components: clients and a central server. The server initializes a global model, which is distributed to client devices—often resource-constrained IoT devices or smartphones. Each client trains the model locally using its private dataset and transmits updates (model parameters) back to the server. These updates are aggregated to refine the global model in an iterative process known as federated optimization [
10,
11]. This iterative training enables the system to converge towards a robust, generalized global model without centralizing sensitive data [
12].
Figure 1 illustrates the training phase of a typical federated learning system, emphasizing the interactions between clients and the central server. This distributed paradigm has proven instrumental in enabling collaborative learning across sectors that demand high standards of data confidentiality while addressing challenges related to computational resource constraints and communication efficiency [
11,
13]. Building on these principles, we will leverage a similar client–server architecture—deploying multiple Raspberry Pi devices as clients—to train lightweight species detection models.
To understand the foundation of federated learning, it is essential to outline its defining characteristics and underlying assumptions, as these significantly influence its research and practical implementation. The following are the primary features and challenges of a typical federated learning setup:
Non-IID Data: Unlike traditional machine learning, where training data is often assumed to be independent and identically distributed (IID), federated learning operates on data generated by diverse devices and users, which results in non-IID data. Such heterogeneity complicates the aggregation of model updates, introducing challenges in achieving optimal convergence and generalization [
14,
15]. In our case, this issue is particularly relevant because our dataset is relatively small and the videos can differ substantially in terms of species and the actions performed by the birds.
Optimization Algorithms: Federated learning relies on iterative optimization techniques involving multiple rounds of communication between clients (devices) and the central server. Each training round combines local model updates to refine a global model. Efficient design and selection of optimization algorithms are fundamental to the scalability and accuracy of federated learning [
16].
Privacy and Security: A cornerstone of federated learning is its ability to preserve privacy by ensuring that raw data remains on local devices. Instead of sharing data, federated learning facilitates the exchange of model parameters or updates, reducing exposure to potential breaches. Nevertheless, federated learning is still vulnerable to attacks such as gradient inversion or poisoning, necessitating robust privacy-preserving and security-enhancing measures [
17,
18].
Heterogeneous Devices: Federated learning involves a decentralized network of devices with varying computational power, memory, and communication capabilities. This heterogeneity poses challenges in coordinating devices, accommodating differences in hardware performance, and managing connectivity limitations. Additionally, asynchronous updates caused by device variability can degrade the overall system performance, demanding innovative solutions to ensure effective collaboration [
19].
Federated learning is a promising technique within the IoT domain due to its ability to protect data privacy and optimize resource use [
20]. By allowing model training directly on IoT devices without the need to transfer sensitive data to a central server, FL ensures data protection. Moreover, FL reduces dependence on a centralized server for model creation by enabling local computations on devices that only transmit model updates [
21]. This approach addresses the challenges posed by distributed, private data in IoT applications, making traditional centralized learning impractical. Additionally, FL minimizes communication resource consumption, a critical factor in scenarios with limited bandwidth [
22].
On the other hand, FL promotes the development of sophisticated models by enabling collaborative learning from multiple devices, taking advantage of the data heterogeneity generated by a vast IoT network. This is especially relevant for resource-constrained devices like robots and drones, which introduce new challenges in performance optimization on platforms like Raspberry Pi. Consequently, FL research focuses on enhancing distributed training in constrained environments, using techniques such as high-capacity device optimization and model parameter compression. Transfer learning is also explored to improve performance in scenarios involving diverse clients. Furthermore, FL is considered a promising technology for future edge intelligence frameworks, particularly in the context of 6G standards [
23].
A key aspect of FL’s success in IoT is the evaluation of its performance. This evaluation is crucial for understanding its effectiveness and optimizing its implementation. The evaluation metrics include communication efficiency, participant contribution, and overall model performance [
24]. As FL gains traction, several frameworks and tools have been developed to aid its implementation, including TensorFlow Federated, PyTorch Federated, Flower, and FedML, which provide environments for building and deploying FL algorithms [
25]. Future research directions focus on developing more efficient aggregation algorithms for complex and heterogeneous data, improving security and privacy mechanisms, creating more scalable and robust solutions, and optimizing communication protocols for the IoT. Additionally, establishing benchmarks and standardized datasets to evaluate FL in IoT environments is essential to advance both research and practical deployment of this technology [
26].
2.2. Federated Learning Challenges
As previously noted, a primary challenge facing federated learning (FL) arises from heterogeneity across three domains: data, model, and system. Data may be non-independent and identically distributed (non-IID), meaning local datasets can have different distributions, hindering the aggregation and generalization of the global model. Model heterogeneity can result in structural differences between local models, while system heterogeneity refers to the varying capabilities of participating devices in terms of processing, storage, and bandwidth [
17]. To address these challenges, strategies such as personalized learning, transfer learning, and clustering techniques have been proposed, which adapt models to the local characteristics of each device. Regarding model aggregation, algorithms like FedAvg, FedMA, and FedProx have been developed to combine local models into a global model [
27]. The effectiveness of FL depends on the choice of aggregation algorithm. Proper alignment of weights between local models is crucial for correct knowledge fusion, and mechanisms have been developed to manage gradients and improve the robustness of aggregation against non-IID data [
15].
Another challenge in IoT settings is that devices often face issues such as connectivity variability, energy efficiency, and system scalability. These challenges require optimization and client selection techniques [
28]. Client selection is essential for improving FL efficiency, especially when a large number of devices have limited resources. In this context, devices are evaluated based on factors such as convergence speed, communication costs, and model performance, and a subset is chosen to participate in each training round. The selection strategy—whether random or greedy—affects system performance and efficiency, with each approach having its own advantages and disadvantages [
15].
Beyond technical challenges related to heterogeneity and client selection optimization, privacy and security are critical concerns in FL. Techniques such as homomorphic encryption (HE) and secure multiparty computation (SMPC) are used to perform computations on encrypted data, avoiding the need for decryption [
29]. Despite these safeguards, FL remains vulnerable to adversarial attacks, such as model poisoning, which can compromise model integrity. Ongoing research aims to develop robust mechanisms to mitigate these risks and ensure security and privacy throughout the learning process [
30].
2.3. Bird Species Detection
Automatic bird species detection is a critical application of artificial intelligence in biodiversity monitoring, allowing ecologists to track species distribution, behavior, and ecosystem health. Traditional methods rely heavily on human expertise, making them labor-intensive, time-consuming, and prone to errors. Machine learning and deep learning approaches offer efficient alternatives to species identification using images or audio data.
Several studies have demonstrated the effectiveness of deep learning in wildlife monitoring. For example, Gomez et al. [
31] showed the potential of very deep convolutional neural networks to identify species in camera trap images, achieving up to 88.9% Top-1 accuracy on curated datasets. Another comprehensive review by Zhang et al. [
32] highlighted how convolutional neural networks have been successfully applied to species identification tasks, surpassing traditional methods in accuracy and scalability. However, challenges [
31] such as imbalanced datasets, rare species detection, and computational limitations on edge devices [
33] persist.
Recent advancements in edge computing have enabled the deployment of lightweight deep learning models on devices like Raspberry Pi. For instance, Mwangi et al. [
34] developed an IoT system integrating camera traps and deep learning models to classify wildlife images directly on edge devices, achieving real-time performance with minimal latency. Despite promising results, these approaches face limitations related to unbalanced data distributions and the need for optimization to enhance edge device performance.
A recent study [
35,
36] evaluated the performance of several state-of-the-art convolutional neural network architectures, including MobileNet [
37], AlexNet [
38], InceptionResNet V2 [
39], Inception V3 [
40], and EfficientNet [
41], in bird species datasets. Using a Kaggle dataset of 11,488 images of 200 bird species, augmented to 40,000 images, the study identified EfficientNet as the top-performing model with a test accuracy of 87.13%. MobileNet and EfficientNet demonstrated the quickest training times, making them suitable for resource-constrained environments such as edge devices.
The integration of federated learning into bird species detection systems offers significant potential. By enabling collaborative training across distributed devices deployed in remote areas, federated learning can leverage local data while preserving privacy. This approach not only reduces the reliance on centralized data collection but also enhances the scalability and adaptability of biodiversity monitoring systems.
3. Methodology
This study leverages the Flower framework, a versatile and scalable platform to implement federated learning in simulated environments. Flower allows for flexible experimentation with federated learning parameters, including model architecture, communication frequency, and aggregation strategies, providing a robust foundation for testing lightweight deep learning models in distributed scenarios.
A federated learning simulation was performed using PyTorch 2.2 within the Flower framework to evaluate the performance of different deep learning architectures for bird species recognition. The primary objective of this study was to evaluate the effectiveness of federated learning in collaboratively training lightweight models optimized for deployment on resource-constrained edge devices, specifically for the task of bird species classification.
The simulation environment emulated a network of client devices, each representing an edge device collecting and processing local data. Clients trained the models locally on their private datasets and periodically shared updates with a centralized server. The server aggregated these updates to refine the global model using federated averaging (FedAvg) [
1]. In FedAvg, the server combines the updates from clients using a weighted average to create a new global model, which is shared back with the devices (see
Figure 2). This iterative process was repeated for a predefined number of communication rounds, simulating the collaborative training process characteristic of Federated learning. The client network used for the simulation was comprised of Raspberry Pi units. Specifically, each unit was a Raspberry Pi 4 Model B, equipped with 8GB of memory, providing sufficient computational resources to simulate the experiments effectively.
Aggregation strategies. Two aggregation strategies were implemented in this study: Federated Averaging (FedAvg) and Federated Proximal (FedProx). In FedAvg, the global model at round
t is updated using a weighted average of the client models,
where
is the model from client
k,
is the number of local data samples, and
is the total number of samples across all clients.
FedProx modifies this approach by incorporating a proximal term into the local objective function to limit the deviation of each client’s model from the global model. The local optimization problem solved by each client is
where
is the local loss function,
is the global model at round
t, and
is a hyperparameter that controls the influence of the proximal term. This regularization helps mitigate client drift in heterogeneous data settings by anchoring local updates closer to the global model. In our experiments, we set
, which is commonly recommended in the literature as a robust default for non-IID federated scenarios.
3.1. Datasets
To perform the bird species classification task, different approaches can be employed. As discussed in
Section 2.3, species classification can be addressed using computer vision techniques; however, another common approach involves analyzing bird vocalizations [
42]. The choice of modality—visual or acoustic—determines the type of dataset required.
Table 1 presents a comparison of datasets across different modalities. While BirdCLEF2024 [
43] is an ideal choice for studies focused on acoustic classification through bird vocalizations, datasets such as CUB-200-2011 [
44] and Visual WetlandBirds [
45] are designed for species classification using visual information from images and videos, respectively.
Since the goal of this study is to classify bird species based on visual data, BirdCLEF2024 was excluded, as it is composed exclusively of audio recordings. The CUB-200-2011 dataset contains images of 200 bird species from around the world. However, in this dataset, birds are typically centered, primarily displayed in the foreground, and clearly distinguishable from the background. These controlled conditions differ significantly from the types of images captured in real-world deployments, and therefore this dataset was not deemed suitable for our use case. In contrast, the Visual WetlandBirds dataset comprises video footage recorded in natural, uncontrolled environments, presenting more realistic and challenging conditions. This makes it better suited for training models intended for real-world applications. To utilize this dataset, the videos were segmented into individual frames, which were then used to train the models evaluated in this study.
3.2. Visual WetlandBirds
The dataset employed in this study is the Visual WetlandBirds dataset, which comprises videos of native bird species from the Valencian Community in Spain. These videos have been manually annotated and processed to generate a collection of labeled frames capturing various bird species engaged in a range of behaviors. Among the annotations composing the dataset, this dataset is outstanding for its fine-grained, per-frame labeling, which includes bird species, behaviors, and bounding-box annotations. The dataset comprises 13 bird species commonly found in the wetlands of the Valencian region, alongside seven different behaviors: Alert, Feeding, Flying, Preening, Resting, Swimming, and Walking. Regarding the behaviors, a total of 858 behavior clips can be found in the dataset (a behavior clip is a bird in a video, performing a specific action over a concrete period of time), with a mean duration of approximately 20 s. A total of 178 videos, which expand to 58 min of video recordings, composes the full dataset. The dataset includes a variety of lighting conditions, backgrounds, and viewpoints, which contribute to the development of robust models capable of handling a wide range of contexts.
Figure 3 shows some frame samples extracted from the dataset.
After extracting frames from the videos, the data were organized according to both species and activity type, resulting in a total of 77,604 frames. To ensure balanced evaluation and reliable performance metrics, the dataset was divided into 51,017 training samples, 13,077 validation samples, and 13,510 testing samples. Before inputting the data into the neural networks, several preprocessing techniques were applied to enhance the quality and robustness of the training process. These preprocessing steps included a normalization procedure to standardize the data and a series of data augmentation methods. The augmentation techniques employed included rotation, scaling, and flipping, which were used to mitigate overfitting and enhance the model’s generalization capability. These methods expanded the diversity of the training data, enabling the model to better handle variations in bird species appearance.
To enable a realistic federated training simulation, the training samples were further split into subsets corresponding to the number of clients involved in each experimental configuration. In this partitioning process, all frames originating from the same video were allocated to a single client’s subset. Additionally, bird species were distributed across clients using heterogeneous sampling strategies, such that certain species were deliberately excluded from some clients. This reflects the real-world deployment scenario, where each Raspberry Pi device operates in a distinct location with varying local fauna and environmental conditions. This setup ensures both temporal and semantic coherence within each client’s dataset while introducing inter-client variability that mimics natural ecological heterogeneity. By preserving the local data distribution in this manner, we can more accurately approximate the operational conditions of a federated learning environment for biodiversity monitoring.
3.3. Models
To address the computational constraints of edge devices such as Raspberry Pi units, this study focused on lightweight deep learning models optimized for object detection and classification tasks. While state-of-the-art architectures like YOLO (You Only Look Once) [
46] provide high accuracy, their computational demands render them unsuitable for resource-constrained environments. Instead, simpler yet effective architectures were selected to balance accuracy and efficiency. MobileNet, known for its use of depth-wise separable convolutions, was chosen for its ability to significantly reduce computational complexity while maintaining competitive performance. EfficientNet, another model considered in this study, uses a compound scaling strategy that uniformly adjusts the network’s depth (number of layers), width (number of channels), and input resolution in a balanced way. This approach allows the model to achieve high accuracy while remaining computationally efficient, making it well suited for deployment in resource-constrained environments. Additionally, simplified YOLO-tiny [
47] variants were included to provide a balance between entity detection capabilities and computational efficiency, adapting the powerful detection principles of YOLO to meet the hardware limitations of edge devices. These models were configured with minimal parameter sizes and specifically tailored to handle the challenges of non-IID data distributions that are typical in federated learning environments, making them suitable for the real-world deployment scenarios simulated in this study.
The federated learning simulation began with the initialization of a global model by the central server, which was configured with pre-trained parameters. Each client device, representing a simulated edge device, was assigned a unique subset of the dataset to replicate the non-IID data environment encountered in practical federated learning scenarios. During each communication round, clients trained the global model locally on their private datasets using stochastic gradient descent (SGD) with predefined learning rates and batch sizes, generating updated model weights. These updates were then transmitted to the server, which employed the FedAvg algorithm to aggregate the model weights, computing a weighted average based on the dataset sizes of the participating clients. The aggregated global model was subsequently redistributed to all clients, marking the start of a new training round. This iterative process was repeated for a predefined number of communication rounds or until the global model achieved a satisfactory level of performance. By simulating the client–server interactions, the study effectively replicated a collaborative training process characteristic of real-world federated learning systems, enabling the evaluation of lightweight models in a distributed and resource-constrained environment.
The effectiveness of the federated learning system was assessed using a comprehensive set of evaluation metrics designed to capture the multidimensional performance of the system. Classification accuracy was measured to quantify the percentage of correctly identified bird species in the test dataset, serving as the primary indicator of the models’ prediction capabilities. To account for the constraints of edge device deployment, model size and inference time were measured, providing insights into the computational footprint and real-time inference capabilities of the selected architectures. As can be seen in
Table 2, these metrics offered a holistic view of the trade-offs between model performance, computational efficiency, and communication costs, ensuring that the models evaluated were not only accurate but also practical for deployment in real-world ecological monitoring scenarios.
An additional and persistent challenge in federated learning, particularly relevant to ecological monitoring, is the presence of non-IID (non-independent and identically distributed) data across clients. In our study, each Raspberry Pi device was assigned frames originating from different video sources and locations, leading to significant heterogeneity in species, backgrounds, and behaviors. This distribution realistically reflects real-world deployments, where localized environments result in inherently biased data partitions.
While our experimental design explicitly preserved this heterogeneity to evaluate realistic deployment conditions, we acknowledge that it also limited the global model’s performance due to difficulties in aggregation. To address this, emerging approaches such as personalized federated learning, where models are adapted locally to each client, or clustered federated learning, where clients are grouped based on data similarity, offer promising directions. These strategies aim to mitigate the impact of non-IID data by promoting partial specialization or improved aggregation alignment. Although implementing these methods is beyond the scope of the present work, future research will explore their integration into biodiversity monitoring systems, particularly to enhance generalization under diverse and fragmented data conditions.
4. Results
In this study, we conducted extensive experiments using various lightweight deep learning models to evaluate their performance in a federated learning setup. The models tested included WideResNet50_2 [
48], EfficientNetV2 [
41], MnasNet [
49], GoogLeNet [
50], and ResNet [
51]. Each model was trained and evaluated under different configurations, varying the number of clients (between 5 and 15), the number of local epochs (10, 20, and 30), and the number of communication rounds (up to 30). The goal was to identify the optimal combination of parameters for our dataset, which consists of labeled images of bird species exhibiting diverse behaviors.
Implementation details. To ensure transparency and reproducibility, we provide the main implementation settings used in all experiments. All models were trained using the PyTorch framework integrated with Flower for federated learning simulations. The optimizer used was stochastic gradient descent (SGD) with momentum set to 0.9, a fixed learning rate of 0.001, and a batch size of 32. We conducted experiments across up to 30 communication rounds, with 15 rounds selected as the optimal value based on convergence behavior. All images were resized and normalized before training, and the following data augmentation techniques were applied to the training set: random rotation (±15°), scaling between 0.8 and 1.2, and horizontal flipping. These preprocessing and optimization parameters were held constant across all configurations to enable fair comparison between models and to support reproducibility of the results.
Evaluation metrics. We report three performance measures: (1) Best Client Accuracy (BCA)—the highest accuracy reached by any single client; (2) Centralized Accuracy (CA)—the accuracy of the aggregated global model; (3) Centralized F1-score (CF1)—the macro-averaged F1-score of the global model, capturing balance across classes.
The experimental results shown in
Table 3 highlight the effectiveness of WideResNet50_2, which achieved the highest centralized F1-score (0.243) using 10 clients and FedAvg. However, a notable observation across all configurations is the significant discrepancy between Best Client Accuracy (BCA) and Centralized Accuracy (CA), with differences as large as 99.89% vs. 18.01% in the case of EfficientNetV2.
This gap is a direct consequence of the non-IID nature of the client data: each client trains exclusively on a limited set of bird species, often from a single location, resulting in narrow and highly homogeneous local distributions. Under these conditions, models can easily overfit to their local task, inflating BCA. Conversely, the centralized model must generalize across the union of all client distributions—which include diverse species, backgrounds, and video conditions—greatly increasing task complexity. This results in lower CA values and illustrates the fundamental challenge of model aggregation in realistic federated setups.
To provide insight into the consistency of global model performance across different runs, we additionally report the standard deviation of CA values across configurations with different client counts. For instance, in the case of WideResNet50_2 (FedAvg), the centralized accuracy varies from 21.87% to 24.33%, with a standard deviation of 1.25. Similarly, MnasNet exhibits a CA range from 5.11% to 18.00%, highlighting how architecture choice and client configuration both significantly impact robustness.
While a full statistical treatment (e.g., confidence intervals or repeated trials) is outside the scope of this initial exploration, these variations already illustrate the sensitivity of global FL performance under heterogeneous client conditions.
For comparison,
Table 4 presents the results of training the same models using a conventional centralized setup, where all training data are pooled together on a single node. As expected, this setup yields substantially higher performance across all centralized metrics. EfficientNetV2, for instance, reaches an F1-score of 0.9041 and an accuracy of 91.42%, far exceeding the performance achieved in any federated configuration. This result reinforces a key point: training with full access to balanced and comprehensive data distributions is significantly easier than federated training in real-world, non-IID environments.
Our study intentionally reproduces the deployment scenario where each edge client (Raspberry Pi) operates in a distinct location, observing different species and environmental conditions. These variations lead to strong inter-client data heterogeneity, making it extremely challenging to train a global model that generalizes well across all devices. The lower centralized accuracies observed in the federated setting are not a flaw of the approach but rather a reflection of this complexity. Consequently, the value of our work lies not in competing with centralized baselines, which are infeasible in practice due to privacy and bandwidth constraints, but in highlighting and analyzing the robustness of lightweight architectures and training configurations under realistic field conditions.
The results also demonstrate how aggregation strategy affects model convergence. In multiple cases, FedProx yielded slightly better centralized accuracies and F1-scores than FedAvg, particularly under lower client counts, suggesting that its proximal term helps stabilize updates when local data distributions vary. However, the computational complexity of more advanced optimizers like FedAdam exceeded our available hardware capacity and remains part of future work.
We observed that 20 local epochs per round provided the best trade-off between convergence speed and overfitting. Runs with 30 epochs yielded marginal gains and were therefore excluded from final reporting. Regarding the number of clients, configurations with 10 clients consistently outperformed those with 15, indicating that excessive data fragmentation degrades aggregation. This highlights the importance of balancing parallelism and data volume per client. Lastly, accuracy improvements plateaued after 15 communication rounds, with no significant changes in global metrics beyond that point.
The differences among models further reveal performance trade-offs. WideResNet50_2 showed strong generalization, likely due to its higher representational capacity, despite being relatively large. On the other hand, MnasNet and GoogLeNet achieved faster local convergence but struggled under aggregation, indicating they may lack sufficient complexity to capture visual diversity in ecological datasets. EfficientNetV2 offered high local performance but also suffered from inconsistent global alignment, possibly due to its architectural sensitivity to distribution shift.
Altogether, these results underscore the need to align architecture, training strategy, and aggregation method with the nature of the dataset and deployment environment. In federated settings where clients collect unbalanced and context-specific data, over-fragmentation and naive aggregation can undermine global model quality. Addressing this through careful configuration, along with future adoption of personalized or clustered FL approaches, will be key to successful deployment in biodiversity monitoring.
5. Discussion
Deploying and evaluating deep learning models within a federated learning environment presents numerous practical challenges, particularly when dealing with constrained computational resources and heterogeneous devices. In our case, achieving a balance between model complexity, inference speed, and accuracy was paramount. After carefully examining the performance of various architectures, WideResNet50_2 emerged as especially promising. Although it possesses a relatively large parameter count (68.8M) and file size (132 MB), it consistently delivered efficient inference times of around 0.37 s on a Raspberry Pi 4. These results confirm that, despite its size, WideResNet50_2 is practical for real-time or near-real-time biodiversity monitoring applications on edge hardware. Notably, no pruning or compression techniques were required, as the model ran smoothly within the device’s memory and computational constraints.
As shown in
Table 2, when comparing WideResNet50_2 with other architectures such as EfficientNetV2, MnasNet, GoogLeNet, and ResNet, the trade-offs become evident. While some models offer lighter footprints or lower latency, they often fail to provide the same level of predictive performance necessary for reliable species detection. Conversely, more complex models may achieve competitive accuracy but impose longer inference times and greater computational overhead, making them less suitable for distributed scenarios with limited processing capabilities. Although EfficientNetV2 is considerably smaller in parameter count, its inference time was significantly higher—likely due to its compound scaling structure and optimization not being as compatible with ARM-based processors.
Taken together, our findings indicate that selecting the WideResNet50_2 model can strike an effective balance between high accuracy and efficient inference, thereby meeting the demands of federated learning in ecological monitoring contexts. By tuning local training epochs, the number of clients, and communication rounds, we further refined performance, ensuring that the federated approach remains both scalable and robust despite data variability and resource constraints. Ultimately, these insights underscore the need for strategic model selection and parameter optimization when leveraging federated learning to support timely and data-driven decisions in biodiversity conservation efforts.
Regarding real-world deployment, the use of an FL architecture enables the training of more robust models by allowing data from different natural parks to be used locally. This approach ensures that the privacy constraints of each park are respected. Additionally, this architecture accelerates the training process, as it eliminates the need to transmit large volumes of images to a central server. In terms of performance, WideResNet50_2 demonstrated strong adaptability for bird species classification in the wild, making it appropriate for deployment in real-world scenarios. As mentioned in
Section 1, the dataset used in the experiments includes a wide range of imaging conditions, which supports the effectiveness of WideResNet50_2 as a robust model capable of handling variations in lighting, viewpoints, and distances from the camera. It is important to note, however, that birds located at a considerable distance from the camera (approximately 10 to 15 m for average-sized species), as well as those that move rapidly through the camera’s field of view (such as birds flying past) are more susceptible to misclassification. These remain open challenges within the object detection research field [
52].
Despite the strong performance, it is recommended to fine-tune the model using data collected from the specific environment in which it is intended to be deployed. For real-world deployments, it is recommended to use cameras with a minimum resolution of 1920 × 1080 and to maintain a fixed camera perspective. A fixed camera setup helps prevent the capture of blurred images caused by the simultaneous movement of both the camera and the bird intended to be classified.
6. Conclusions
In this study, a federated learning architecture was developed to perform the bird species detection task. Furthermore, the Visual WetlandBirds dataset was introduced to train the architecture to detect species present in a natural park in Valencia, Spain. An experiment was carried out using different models and hyperparameters, such as number of clients, rounds, and epochs, in order to select the best setting for the purpose of the study.
The results of the experiment showed that the WideResNet50_2 model is the one that gives the better centralized accuracy performance, though the EfficientNetV2 model achieves the highest score in terms of the best client accuracy. These results show the open challenge posed by weight aggregation methods in federated learning in scenarios where training data is limited, especially when limited data is distributed over a large number of clients. Moreover, the study revealed the importance of hyperparameter tuning in federated learning architectures, as 20 epochs represented the best time/accuracy trade-off and 10 clients was the number of clients that achieved higher scores.
As future work, the primary objective is to expand the training dataset by collecting and annotating new videos while maintaining a diverse representation of species and behaviors. This effort aims to address one of the main limitations hindering the application of the proposed approach in real-world scenarios: the distributed and limited availability of data. Additionally, future research will explore the use of different aggregation methods, such as FedAdam [
53], FedProx [
54], and FedTrimmedAvg [
55], to evaluate their impact on the training process. Developing novel aggregation methods tailored specifically for data-constrained environments is also identified as a promising avenue for further investigation. Finally, the integration of active learning techniques [
56,
57] is planned to facilitate the deployment of the system in real-world scenarios, enabling incremental data collection and training. This approach is expected to enhance model accuracy over time as additional data becomes available and is incorporated into the federated architecture.
Author Contributions
Conceptualization, J.R.-J. and D.M.-P.; methodology, J.R.-J., D.M.-P. and T.R.-G.; software, J.R.-J., D.M.-P., D.O.-P., M.B.-L. and P.R.-P. validation, D.M.-P., J.R.-J., D.O.-P., M.B.-L., H.H.-L., T.R.-G. and A.I.; formal analysis, J.R.-J., D.M.-P., D.O.-P., and M.B.-L.; investigation, J.R.-J., D.O.-P., and J.G.-R.; resources, J.G.-R., E.S.-G., T.R.-G., and B.A.; data curation, J.R.-J., D.M.-P., H.H.-L. and A.I.; writing—original draft preparation, D.M.-P., J.R.-J., T.R.-G., D.O.-P. and M.B.-L.; writing—review and editing, J.R.-J., D.M.-P., D.O.-P., M.B.-L., E.S.-G., J.G.-R. and S.I.P.; visualization, D.M.-P., J.R.-J., D.O.-P., M.B.-L., H.H.-L. and A.I.; supervision, J.G.-R., E.S.-G., O.J., S.I.P. and B.A.; project administration, J.G.-R., E.S.-G., O.J., S.I.P. and B.A.; funding acquisition, J.G.-R., E.S.-G., O.J., S.I.P. and B.A. All authors have read and agreed to the published version of the manuscript.
Funding
We would like to thank the “A way of making Europe” European Regional Development Fund (ERDF) and MCIN/AEI/10.13039/501100011033 for supporting this work under the “CHAN-TWIN” project (grant TED2021-130890B-C21. HORIZON-MSCA-2021-SE-0 action number: 101086387, REMARKABLE, Rural Environmental Monitoring via ultra wide-ARea networKs And distriButed federated Learning). This work is part of the HELEADE project (TSI-100121-2024-24), funded by the Spanish Ministry of Digital Processing and by the European Union NextGeneration EU. This work has also been supported by two Spanish national and three regional grants for PhD studies, FPU21/00414, FPU22/04200, FPU23/00532, CIACIF/2021/430 and CIACIF/2022/175.
Data Availability Statement
Conflicts of Interest
The authors declare no conflicts of interest.
References
- McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A.y. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; Singh, A., Zhu, J., Eds.; PMLR (Proceedings of Machine Learning Research). 2017; Volume 54, pp. 1273–1282. Available online: https://proceedings.mlr.press/v54/mcmahan17a.html (accessed on 28 June 2025).
- Siddique, A.A.; Alasbali, N.; Driss, M.; Boulila, W.; Alshehri, M.S.; Ahmad, J. Sustainable collaboration: Federated learning for environmentally conscious forest fire classification in green internet of things (IoT). Internet Things 2024, 25, 101013. [Google Scholar]
- O’Riordan, T. Environmental Science for Environmental Management; Longman: London, UK, 1995. [Google Scholar]
- Nichols, J.D.; Williams, B.K. Monitoring for conservation. Trends Ecol. Evol. 2006, 21, 668–673. [Google Scholar] [PubMed]
- Hays, G.C.; Bailey, H.; Bograd, S.J.; Bowen, W.D.; Campagna, C.; Carmichael, R.H.; Casale, P.; Chiaradia, A.; Costa, D.P.; Cuevas, E.; et al. Translating marine animal tracking data into conservation policy and management. Trends Ecol. Evol. 2019, 34, 459–473. [Google Scholar] [PubMed]
- Xu, J.; Glicksberg, B.S.; Su, C.; Walker, P.; Bian, J.; Wang, F. Federated learning for healthcare informatics. J. Healthc. Inform. Res. 2021, 5, 1–19. [Google Scholar]
- Byrd, D.; Polychroniadou, A. Differentially private secure multi-party computation for federated learning in financial applications. In Proceedings of the First ACM International Conference on AI in Finance, New York, NY, USA, 15–16 October 2020; pp. 1–9. [Google Scholar]
- Nguyen, D.C.; Ding, M.; Pathirana, P.N.; Seneviratne, A.; Li, J.; Poor, H.V. Federated learning for internet of things: A comprehensive survey. IEEE Commun. Surv. Tutorials 2021, 23, 1622–1658. [Google Scholar]
- Xia, Q.; Dong, S.; Peng, T. An Abnormal Traffic Detection Method for IoT Devices Based on Federated Learning and Depthwise Separable Convolutional Neural Networks. In Proceedings of the 2022 IEEE International Performance, Computing, and Communications Conference (IPCCC), Austin, TX, USA, 11–13 November 2022; pp. 352–359. [Google Scholar] [CrossRef]
- Elouali, A.; Mora Mora, H.; Mora-Gimeno, F.J. Data transmission reduction formalization for cloud offloading-based IoT systems. J. Cloud Comput. 2023, 12, 1–12. [Google Scholar] [CrossRef]
- Ramírez, T.; Calabuig-Barbero, E.; Mora, H.; Pujol, F.A.; Amador, S. Federated Learning for Industry 5.0: A State-of-the-Art Review. In Proceedings of the 15th International Conference on Ubiquitous Computing & Ambient Intelligence (UCAmI 2023); Bravo, J., Urzáiz, G., Eds.; Springer: Cham, Switzerland, 2023; pp. 60–66. [Google Scholar] [CrossRef]
- Li, T.; Sahu, A.K.; Talwalkar, A.; Smith, V. Federated Learning: Challenges, Methods, and Future Directions. IEEE Signal Process. Mag. 2020, 37, 50–60. [Google Scholar] [CrossRef]
- Ramírez, T.; Mora, H.; Pujol, F.A.; Maciá-Lillo, A.; Jimeno-Morenilla, A. Management of heterogeneous AI-based industrial environments by means of federated adaptive-robot learning. Eur. J. Innov. Manag. 2025, 28, 50–64. [Google Scholar] [CrossRef]
- Zhu, H.; Xu, J.; Liu, S.; Jin, Y. Federated learning on non-IID data: A survey. Neurocomputing 2021, 465, 371–390. [Google Scholar] [CrossRef]
- Ma, X.; Zhu, J.; Lin, Z.; Chen, S.; Qin, Y. A state-of-the-art survey on solving non-IID data in Federated Learning. Future Gener. Comput. Syst. 2022, 135, 244–258. [Google Scholar] [CrossRef]
- Kumar, D.; Shanthakumar, S.; Banerjee, M.; Hanspal, M.S. IoT Based Models in Healthy Natural Resource Management: Healthy Soils for Healthy Food Productions. In IoT-Based Models for Sustainable Environmental Management: Sustainable Environmental Management; Springer: Berlin/Heidelberg, Germany, 2024; pp. 211–242. [Google Scholar] [CrossRef]
- Ye, M.; Fang, X.; Du, B.; Yuen, P.C.; Tao, D. Heterogeneous Federated Learning: State-of-the-art and Research Challenges. ACM Comput. Surv. 2023, 56, 79. [Google Scholar] [CrossRef]
- Zhang, K.; Song, X.; Zhang, C.; Yu, S. Challenges and future directions of secure federated learning: A survey. Front. Comput. Sci. 2022, 16, 165817. [Google Scholar] [CrossRef]
- Qi, P.; Chiaro, D.; Piccialli, F. Small models, big impact: A review on the power of lightweight Federated Learning. Future Gener. Comput. Syst. 2025, 162, 107484. [Google Scholar] [CrossRef]
- Imteaj, A.; Thakker, U.; Wang, S.; Li, J.; Amini, M.H. A Survey on Federated Learning for Resource-Constrained IoT Devices. IEEE Internet Things J. 2022, 9, 1–24. [Google Scholar] [CrossRef]
- Gao, Y.; Kim, M.; Thapa, C.; Abuadbba, A.; Zhang, Z.; Camtepe, S.; Kim, H.; Nepal, S. Evaluation and Optimization of Distributed Machine Learning Techniques for Internet of Things. IEEE Trans. Comput. 2022, 71, 2538–2552. [Google Scholar] [CrossRef]
- Khan, L.U.; Saad, W.; Han, Z.; Hossain, E.; Hong, C.S. Federated Learning for Internet of Things: Recent Advances, Taxonomy, and Open Challenges. IEEE Commun. Surv. Tutorials 2021, 23, 1759–1799. [Google Scholar] [CrossRef]
- Ridolfi, L.; Naseh, D.; Shinde, S.S.; Tarchi, D. Implementation and Evaluation of a Federated Learning Framework on Raspberry PI Platforms for IoT 6G Applications. Future Internet 2023, 15, 358. [Google Scholar] [CrossRef]
- Dritsas, E.; Trigka, M. Federated Learning for IoT: A Survey of Techniques, Challenges, and Applications. J. Sens. Actuator Netw. 2025, 14, 9. [Google Scholar] [CrossRef]
- Riedel, P.; Schick, L.; von Schwerin, R.; Reichert, M.; Schaudt, D.; Hafner, A. Comparative analysis of open-source federated learning frameworks-a literature-based survey and review. Int. J. Mach. Learn. Cybern. 2024, 15, 5257–5278. [Google Scholar] [CrossRef]
- Qi, P.; Chiaro, D.; Guzzo, A.; Ianni, M.; Fortino, G.; Piccialli, F. Model aggregation techniques in federated learning: A comprehensive survey. Future Gener. Comput. Syst. 2024, 150, 272–293. [Google Scholar] [CrossRef]
- Adam, M.; Baroudi, U. Federated Learning for IoT: Applications, Trends, Taxonomy, Challenges, Current Solutions, and Future Directions. IEEE Open J. Commun. Soc. 2024, 5, 7842–7877. [Google Scholar] [CrossRef]
- Ghadi, Y.Y.; Mazhar, T.; Shah, S.F.A.; Haq, I.; Ahmad, W.; Ouahada, K.; Hamam, H. Integration of federated learning with IoT for smart cities applications, challenges, and solutions. PeerJ Comput. Sci. 2023, 9, e1657. [Google Scholar] [CrossRef] [PubMed]
- Xie, Q.; Jiang, S.; Jiang, L.; Huang, Y.; Zhao, Z.; Khan, S.; Dai, W.; Liu, Z.; Wu, K. Efficiency Optimization Techniques in Privacy-Preserving Federated Learning With Homomorphic Encryption: A Brief Survey. IEEE Internet Things J. 2024, 11, 24569–24580. [Google Scholar] [CrossRef]
- Guendouzi, B.S.; Ouchani, S.; EL Assaad, H.; EL Zaher, M. A systematic review of federated learning: Challenges, aggregation methods, and development tools. J. Netw. Comput. Appl. 2023, 220, 103714. [Google Scholar] [CrossRef]
- Gomez Villa, A.; Salazar, A.; Vargas, F. Towards automatic wild animal monitoring: Identification of animal species in camera-trap images using very deep convolutional neural networks. Ecol. Inform. 2017, 41, 24–32. [Google Scholar] [CrossRef]
- Petso, T.; Jamisola Jr, R.S.; Mpoeleng, D. Review on methods used for wildlife species and individual identification. Eur. J. Wildl. Res. 2022, 68, 3. [Google Scholar]
- Varghese, B.; Wang, N.; Barbhuiya, S.; Kilpatrick, P.; Nikolopoulos, D.S. Challenges and Opportunities in Edge Computing. In Proceedings of the 2016 IEEE International Conference on Smart Cloud (SmartCloud), New York, NY, USA, 18–20 November 2016; pp. 20–26. [Google Scholar] [CrossRef]
- Zualkernan, I.; Dhou, S.; Judas, J.; Sajun, A.R.; Gomez, B.R.; Hussain, L.A. An IoT System Using Deep Learning to Classify Camera Trap Images on the Edge. Computers 2022, 11, 13. [Google Scholar] [CrossRef]
- Kumar, S.V.S.; Kondaveerti, H.K. A Comparative Study on Deep Learning Techniques for Bird Species Recognition. In Proceedings of the 2023 3rd International Conference on Intelligent Communication and Computational Techniques (ICCT), Jaipur, India, 19–20 January 2023; pp. 1–6. [Google Scholar] [CrossRef]
- Dong, S.; Wang, P.; Abbas, K. A survey on deep learning and its applications. Comput. Sci. Rev. 2021, 40, 100379. [Google Scholar] [CrossRef]
- Koonce, B. MobileNetV3. In Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization; Apress: Berkeley, CA, USA, 2021; pp. 125–144. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems; Pereira, F., Burges, C., Bottou, L., Weinberger, K., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2012; Volume 25. [Google Scholar]
- Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, inception-ResNet and the impact of residual connections on learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, San Francisco, CA, USA, 4–9 February 2017; AAAI Press: Washington, DC, USA, 2017; pp. 4278–4284. [Google Scholar]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
- Tan, M.; Le, Q. Efficientnetv2: Smaller models and faster training. In Proceedings of the International Conference on Machine Learning, PMLR, Online, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
- Stastny, J.; Munk, M.; Juranek, L. Automatic bird species recognition based on birds vocalization. EURASIP J. Audio Speech Music. Process. 2018, 2018, 19. [Google Scholar]
- Miyaguchi, A.; Cheung, A.; Gustineli, M.; Kim, A. Transfer Learning with Pseudo Multi-Label Birdcall Classification for DS@ GT BirdCLEF 2024. arXiv 2024, arXiv:2407.06291. [Google Scholar]
- Wah, C.; Branson, S.; Welinder, P.; Perona, P.; Belongie, S. The Caltech-Ucsd Birds-200-2011 Dataset. 2011. Available online: https://authors.library.caltech.edu/records/cvm3y-5hh21 (accessed on 28 June 2025).
- Rodriguez-Juan, J.; Ortiz-Perez, D.; Benavent-Lledo, M.; Mulero-Pérez, D.; Ruiz-Ponce, P.; Orihuela-Torres, A.; Garcia-Rodriguez, J.; Sebastián-González, E. Visual WetlandBirds Dataset: Bird Species Identification and Behavior Recognition in Videos. arXiv 2025, arXiv:2501.08931. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
- Terven, J.; Córdova-Esparza, D.M.; Romero-González, J.A. A comprehensive review of yolo architectures in computer vision: From yolov1 to yolov8 and yolo-nas. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar]
- Zagoruyko, S.; Komodakis, N. Wide Residual Networks. arXiv 2017, arXiv:1605.07146. [Google Scholar]
- Tan, M.; Chen, B.; Pang, R.; Vasudevan, V.; Sandler, M.; Howard, A.; Le, Q.V. MnasNet: Platform-Aware Neural Architecture Search for Mobile. arXiv 2019, arXiv:1807.11626. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. arXiv 2014, arXiv:1409.4842. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
- Diwan, T.; Anirudh, G.; Tembhurne, J.V. Object detection using YOLO: Challenges, architectural successors, datasets and applications. Multimed. Tools Appl. 2023, 82, 9243–9275. [Google Scholar]
- Reddi, S.; Charles, Z.; Zaheer, M.; Garrett, Z.; Rush, K.; Konečný, J.; Kumar, S.; McMahan, H.B. Adaptive Federated Optimization. arXiv 2021, arXiv:2003.00295. [Google Scholar]
- Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated Optimization in Heterogeneous Networks. arXiv 2020, arXiv:1812.06127. [Google Scholar]
- Yin, D.; Chen, Y.; Ramchandran, K.; Bartlett, P. Byzantine-Robust Distributed Learning: Towards Optimal Statistical Rates. arXiv 2021, arXiv:1803.01498. [Google Scholar]
- Settles, B. Active Learning Literature Survey. 2009. Available online: https://burrsettles.com/pub/settles.activelearning.pdf (accessed on 28 June 2025).
- Goetz, J.; Malik, K.; Bui, D.; Moon, S.; Liu, H.; Kumar, A. Active Federated Learning. arXiv 2019, arXiv:1909.12641. [Google Scholar]
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).