FedAAR: A Novel Federated Learning Framework for Animal Activity Recognition with Wearable Sensors

Simple Summary Automated animal activity recognition has achieved great success due to the recent advances in deep learning, allowing staff to identify variations in the animal behavioural repertoire in real-time. The high performance of deep learning largely relies on the availability of big data, which inevitably brings data privacy issues when collecting a centralised dataset from different farms. Federated learning provides a promising solution to train a shared model by coordinating multiple farms (clients) without sharing their private data, in which a global server periodically aggregates local (client) gradients to update the global model. The study develops a novel federated learning framework called FedAAR to achieve automated animal activity recognition using decentralised sensor data and to address, in particular, two major challenges resulting from data heterogeneity when applying federated learning in this task. The experiments demonstrate the performance advantages of FedAAR compared to the state-of-the-art, proving the promising capability of our framework for enhancing animal activity recognition performance. This research opens new opportunities for developing animal monitoring systems using decentralised data from multiple farms without privacy leakage. Abstract Deep learning dominates automated animal activity recognition (AAR) tasks due to high performance on large-scale datasets. However, constructing centralised data across diverse farms raises data privacy issues. Federated learning (FL) provides a distributed learning solution to train a shared model by coordinating multiple farms (clients) without sharing their private data, whereas directly applying FL to AAR tasks often faces two challenges: client-drift during local training and local gradient conflicts during global aggregation. In this study, we develop a novel FL framework called FedAAR to achieve AAR with wearable sensors. Specifically, we devise a prototype-guided local update module to alleviate the client-drift issue, which introduces a global prototype as shared knowledge to force clients to learn consistent features. To reduce gradient conflicts between clients, we design a gradient-refinement-based aggregation module to eliminate conflicting components between local gradients during global aggregation, thereby improving agreement between clients. Experiments are conducted on a public dataset to verify FedAAR’s effectiveness, which consists of 87,621 two-second accelerometer and gyroscope data. The results demonstrate that FedAAR outperforms the state-of-the-art, on precision (75.23%), recall (75.17%), F1-score (74.70%), and accuracy (88.88%), respectively. The ablation experiments show FedAAR’s robustness against various factors (i.e., data sizes, communication frequency, and client numbers).


Introduction
Monitoring and assessing animal activities provide rich insights into their physical status and circumstances, as activity is one of the most critical indicators of animal health and welfare [1]. Traditionally, animal activity monitoring largely relies on direct visual and behavioural observation, which is time-consuming and labour-intensive [2]. Over the past decade, automated animal activity recognition (AAR) with wearable sensors, which allows staff to identify variations in the animal behavioural repertoire in real-time, has attracted increasing attention and achieved great success. In the sensor-based AAR systems, wearable sensors are attached to a certain part of the animal body (e.g., ear, neck, halter, back, and leg) to collect motion data (e.g., acceleration and angular velocity), which are then used for classifying animal activities (e.g., feeding, drinking, and resting) with suitable classification algorithms.
In recent years, deep learning has dominated the tasks in AAR due to the high performance achievable with the help of large-scale training datasets [3,4]. For instance, convolutional neural networks (CNNs) are widely used to automatically classify various animal activities, such as the walking and ruminating of cattle [4], the trotting and cantering of equines [2], and the eating and petting of canines [5]. However, collecting a large corpus of centralised datasets from different sources (e.g., farms) often raises data privacy issues. Federated learning (FL) has recently emerged as a distributed learning paradigm, providing an attractive solution to this problem [6][7][8]. A standard FL system iterates two steps periodically, i.e., local training in farms (clients) and global aggregation in a trustworthy centre (server), to train a global model. Specifically, during local training, each client downloads the parameters of the global model from the server-side to initialise its local model and then exploits local data to calculate local (client) gradients, which are sent to the server in turn. The server collects these local gradients and aggregates them to update the global model. Such a training mechanism enables data owners to build a shared model collaboratively without exchanging their private data, effectively promoting privacy preservation between clients [9][10][11].
Despite remarkable benefits provided by FL, directly applying FL to AAR tasks often faces two major challenges: client-drift during local training and local gradient conflicts during global aggregation, which easily increase the difficulty in model convergence and cause extreme degradation of performance [12]. First, the movement patterns of individual animals are often drawn from distinct distributions, which inevitably results in data heterogeneity between clients. Such data heterogeneity enlarges the inconsistency of learned features across clients, easily raising drift concerns between client updates since each client model is optimised towards its local objective instead of global optima during local training [13][14][15]. To address this issue, some existing methods [14][15][16][17] impose constraints on the local optimisation by exploiting a model-level regularisation term, which aims to facilitate all local models to approach consistent views. For instance, FedProx restricted local model parameters to be close to global parameters by adding a proximal term in the local training [16]. SCAFFOLD corrected for client-drift by using control variates to overcome gradient dissimilarity in local training [15]. Both FedLSD and FedCAD regarded the distributed global model as a teacher and distilled its predictions on local data to guide local optimisation [13,17]. However, these methods only emphasised constraints on local models instead of directly forcing clients to learn consistent features, consequently yielding sub-optimal performance. Second, gradients among clients often possess inconsistent directions and even have conflicting components due to the inconsistency between local optimisation objectives in the context of data heterogeneity. Directly aggregating all local gradients in standard FL methods easily leads to mutual interferences among clients' knowledge, further hampering the process of model convergence and exacerbating the risk of model divergence.
To alleviate this problem, most existing works [18][19][20][21] attempted to modify the global aggregation mechanism by dynamically adjusting aggregation weights to local gradients under different criteria. For example, IDA assigned each client weight based on the inverse distance of its gradient to the averaged gradient across all clients [21]. Precision-weighted FL aggregated local gradients by averaging gradients by the inverse of their estimated variance [19]. ABAVG combined gradients over clients by the accuracies of local models on the validation set at the server-side [18]. However, aggregating local gradients merely by changing their weights still cannot drastically remove conflicting components in gradients and may even lose important information in gradients.
In this study, we propose a novel FL framework, namely FedAAR, to achieve automated AAR by uniting decentralised data while tackling the abovementioned issues. To alleviate the client-drift issue, we design a prototype-guided local update (PLU) module, in which we introduce a globally shared prototype (classwise feature representation) as shared knowledge to constrain local optimisation. To reduce conflicts between local gradients, we devise a new gradient-refinement-based aggregation (GRA) module to constantly recalibrate local gradients throughout the training process. The proposed FedAAR is trained based on a public dataset [22], and its generalisation performance is compared with that of the state-of-the-art FL strategies and with the centralised learning algorithm. In summary, the main contributions of this study are as follows:

•
We propose a novel FL framework called FedAAR to automatically recognise animal activities based on distributed data. This gives it the considerable potential that multiple farms jointly train a shared model using their private dataset while protecting data privacy and ownership. To the best of our knowledge, this is the first time the FL method has been explored in automated AAR across multiple data sources.

•
To alleviate client-drift issues, we devise a PLU module to replace the traditional local training process. The PLU module forces all clients to learn consistent feature knowledge by imposing a global prototype guidance constraint to local optimisation, further reducing the divergence between client updates. • Different from existing aggregation mechanisms, we design a GRA module to reduce conflicts among local gradients. The GRA module eliminates conflicting components between local gradients during global aggregation, effectively guaranteeing that all refined local gradients point in a positive direction to improve the agreement among clients.

•
Our experimental results demonstrate that our proposed FedAAR outperforms the state-of-the-art FL strategies and exhibits performance close to that of the centralised learning algorithm. This proves the promising capability of our method to enhance AAR performance without privacy leakage. We also validate the performance advantages of our approach compared to the baseline in different practical scenarios (i.e., local data sizes, communication frequency, and client numbers), providing rich insights into the appropriate future applications of our method.

Data Description
The dataset used in this study is a public dataset created by [22]. This dataset is a centralised dataset comprising 87,621 two-second samples that were collected from six horses with neck-attached IMUs. The sampling rate was set to 100 Hz for both the tri-axial accelerometer and gyroscope and 12 Hz for the tri-axial magnetometer. Table 1 illustrates the data distribution of the dataset. Six activities (i.e., eating, galloping, standing, trotting, walking-natural, and walking-rider) were registered, and the number of activity samples of the six individuals (i.e., Happy, Zafir, Driekus, Galoway, Patron, and Bacardi) was 23,625, 11,071, 10,127, 24,602, 12,849, and 5347, respectively. As in our previous study set [23], we exploited the tri-axial motion data from the accelerometer and gyroscope as our input samples, which were normalised before being input into the network.  Happy  5063  696  1186  7038  746  8896  23,625  Zafir  1091  835  347  3559  161  5078  11,071  Driekus  2496  323  341  2673  270  4024  10,127  Galoway  4331  1043  1750  6423  1402  9653  24,602  Patron  1951  714  1244  3402  388  5150  12,849  Bacardi  1116  328  245  1981  360 1317 5347

Preliminaries for Federated Learning
A federated learning (FL) system coordinates K clients to collectively train a global model while keeping their data stored locally, effectively reducing the potential for violating data privacy. In each client k, the local dataset is sampled from a distribution D k , where y k i ∈ {1, · · · , C} corresponds to the ground-truth label of the data instance x k i , C is the number of label categories, and N k is the data number of the k-th client. The training of the FL system mainly consists of T communication rounds between a global server and K clients, with the detailed procedure of each communication round divided into the following three steps: Step 1. All clients synchronously download a global model w global from a global server.
Step 2. Each client k uses the global model w global to initialise its local model w k (i.e., w k initial = w global ) and then conducts local training for E epochs, i.e., minimising the local optimisation objective by using a gradient descent algorithm: where L k CE denotes the cross-entropy loss function of the k-th client. After local training, we can obtain the updated local model w k updated and local gradient g k (i.e., the difference between the updated local model w k updated and the initial local model w k initial ).
Step 3. Each client k then uploads its local gradient g k to the global server. These local from all clients are aggregated by directly averaging to generate global gradients g global : Afterwards, the global gradient g global is used to further update the original global model w global : The updated global model w global will be sent to all clients again in the next communication round (Step 1). The above steps are performed repeatedly until the global model achieves convergence.
The implementation of standard FL (e.g., FedAvg) heavily relies on the assumption that no data heterogeneity issues occur across clients (i.e., data between clients follow a uniform distribution). However, this assumption does not hold in AAR tasks because the discrepancy of movement patterns among individual animals often results in data heterogeneity, thus giving rise to detrimental effects on the training of FL. Specifically, data heterogeneity across clients inevitably enlarges the inconsistency of learned features between clients, thereby inducing drift between client updates as each client model is optimised toward its local objective instead of global optima [17]. In addition, local gradients In this study, we propose a novel FL framework called FedAAR to achieve automated AAR in the context of data heterogeneity between clients, as illustrated in Figure 1 Hence, there is a need to reformulate the above optimisation process to adapt to AAR tasks.

Overview
In this study, we propose a novel FL framework called FedAAR to achieve automated AAR in the context of data heterogeneity between clients, as illustrated in Figure  1   communication round includes three steps as follows. 1 Each client k first downloads the same global model w global and global prototype P global from a global server simultaneously. 2 Based on the local dataset D k in each client k, the local model w k is initialised and then trained in a prototype-guided local update (PLU) module ( Figure 1a). After local training, the updated local model w k updated and local prototype P k updated can be obtained. The local gradient g k can be calculated as the difference between the updated local model w k updated and the initial local model (i.e., the downloaded global model w global ). 3 All local gradients g k K k=1 and local prototypes P k

Prototype-Guided Local Update
To alleviate client-drift issues in the local training, we devised a PLU module (Figure 1a), which introduces a global prototype to serve as the shared feature knowledge to guide local optimization.
First, each client k simultaneously downloads the same global model w global and global prototype P global from the server. Based on the local dataset D k in each client k, the local model w k is initialised as the global model w global and then trained in the proposed PLU module. Specifically, given batchwise samples we first adopted the feature extractor to extract feature representation e k c,i , where P k c is the mean value of feature representations of samples in class c, i.e., Herein, we empirically used the network that removed the last fully connected layer as the feature extractor [24].
Inspired by prototype learning, in which gathering the prototypes across heterogeneous datasets enables the incorporation of feature representations over various data distributions [24,25], we brought a global prototype P global c C c=1 (P global ) aggregated across clients as consistent feature-level views to guide local training. We propose a new prototype guidance regularisation (PGReg) loss L k PGReg as follows: , effectively keeping all clients as having a consistent direction of feature learning. Thus, the total loss function can be reformulated as the linear combination of the original classification loss L k CE and PGReg loss L k PGReg : where λ is the weight coefficient of PGReg loss and its value equals 0 in the initial state. Then, we conducted the local training (see Equation (1)) to obtain the updated local model where γ c is an adaptively balanced coefficient controlling the updating degree of the global prototype and P c = ∑ K k=1 n k c P k c,updated / ∑ K k=1 n k c , where n k c represents the number of correctly classified samples belonging to the c-th category in client k. Considering that the updated global prototype may be transferred close to each other due to noise components, inducing similarity increases of inter-class feature vectors [26], we modulated γ c according to the intra-class and inter-class distance between local prototypes and the original global prototype: ), the contribution of P c on the update process (Equation (7)) should be lower than that of P global c . Note that when the value of P global c is empty at the early training phase, we directly put the updated averaged prototype P c into global prototype P global c .

Gradient-Refinement-Based Aggregation
To reduce conflicts among local gradients during global aggregation, we designed a new GRA module (Figure 1b), which eliminates the conflicting components between local gradients, ensuring all refined local gradients point in a positive direction to improve the agreement across clients.
Given a set of local gradients g k K k=1 that are uploaded to the global server, we first characterised any two of these gradients as conflicting when their directions point away from one another (i.e., having a negative cosine similarity). Herein, we aimed to reconstruct consensus vectors by refining the conflicting local gradients and keeping the non-conflicting local gradients invariant. Figure 2 visualises the main step of the local gradient refinement process. Specifically, suppose g i is the local gradient at the i-th client and g j is selected in a random order from the rest of the local gradients, where i ∈ {1, 2, . . . , K} and j ∈ {1, 2, . . . , i − 1, i + 1, . . . , K}. The cosine similarity between g i and g j can be denoted as cos θ i,j , where θ i,j is the angle between g i and g j . As shown in Figure 2a, if g i conflicts with g j (i.e., cosine similarity cos θ i,j < 0), we remove the component of g i in the direction fully opposite that of g j and alter g i by its projection g i onto the normal plane of g j : If g i and g j are not in conflict (i.e., cosine similarity cos θ i,j > 0), we retain the original local gradient g i as unchanged (i.e., g i = g i ), as shown in Figure 2b. Afterward, the updated g i is further selectively updated according to the condition of whether there are conflicting components compared to other local gradients. This process is repeated until all of the local gradients are compared.
is a collection of refined local gradients, we then aggregate them using Equations (2) and (3)  Supposing is a collection of refined local gradients, we then aggregate them using Equations (2) and (3) to further update the global model .

Evaluation Methods
The precision, recall, F1-score, and accuracy were measured to indicate the comprehensive performance of the classification model. They are defined as follows: where TP, FP, TN, and FN are the number of true positives, false positives, true negatives, and false negatives, respectively.

Implementation Details
In our previous work, we established a cross-modality interaction network (CMI-Net) for equine activity recognition based on accelerometer and gyroscope data [23]. The CMI-Net consists of a dual CNN trunk architecture and a joint cross-modality interaction module, effectively improving the classification performance for equine activities. Herein, we used CMI-Net as the global classification model to achieve our proposed FedAAR.
During training, we used softmax cross-entropy loss with L2 regularisation (a weight decay of 0. 15). An Adam optimiser with an initial learning rate of 5 × 10 −5 was used, and Here, θ i,j represents the angle between two local gradients g i and g j and g i represents the refined local gradients in client i. (a) When gradients g i and g j are conflicting (i.e., cos θ i,j < 0), the gradient g i is replaced with its projection g i onto the normal plane of g j . (b) When gradients g i and g j are non-conflicting (i.e., cos θ i,j > 0), gradient g i is unchanged (i.e., g i = g i ).

Evaluation Methods
The precision, recall, F1-score, and accuracy were measured to indicate the comprehensive performance of the classification model. They are defined as follows: where TP, FP, TN, and FN are the number of true positives, false positives, true negatives, and false negatives, respectively.

Implementation Details
In our previous work, we established a cross-modality interaction network (CMI-Net) for equine activity recognition based on accelerometer and gyroscope data [23]. The CMI-Net consists of a dual CNN trunk architecture and a joint cross-modality interaction module, effectively improving the classification performance for equine activities. Herein, we used CMI-Net as the global classification model to achieve our proposed FedAAR.
During training, we used softmax cross-entropy loss with L2 regularisation (a weight decay of 0. 15). An Adam optimiser with an initial learning rate of 5 × 10 −5 was used, and the learning rate decreased by 0.1 times every 20 epochs. The communication rounds T and batch size were set to 100 and 256, respectively. If not specified, the value of local training epochs E was set to 1, and the weighting coefficient λ in PLU was set to 0.05 by default. The global model was initialised randomly before being downloaded to clients in the first round. Over the communication rounds, the best global model with the highest testing accuracy was saved as the optimal model. To verify the model's generalisation abilities, we performed the leave-one-out-based validation method. Specifically, we separately ran three times in each experiment. In each run (time), we randomly selected five horses from the original six horses and individually assigned these five horses to five farms (clients). Each of these five horses' data serves as each client's data, and all client data were used as training data to train a shared global model collectively. The data from the remaining horse (the sixth horse) were then used as the test data to verify the performance of the trained global model. The final test result of the model performance is presented in the format mean ± std from the three runs. This kind of data allocation can well simulate practical scenarios, i.e., data heterogeneity across farms, since the movement patterns of individual animals are often drawn from distinct distributions. All experiments were executed using the PyTorch framework on an NVIDIA Tesla V100 GPU. The source code is available at https://github.com/Max-1234-hub/FedAAR (accessed on 20 June 2022).

Results and Discussion
Overall, the experimental results demonstrate that our proposed FedAAR outperforms the state-of-the-art FL strategies from both quantitative and qualitative perspectives while exhibiting performance close to that of the centralised learning algorithm. Ablation studies were then carried out to evaluate the effectiveness of the PLU and GRA module on the classification capability. In addition, a comprehensive investigation of FedAAR's performance was conducted at different levels of three practical conditions (i.e., dataset sizes on local clients, communication frequency between local clients and the global server, and client numbers). The experimental results of FedAAR were compared with those of its corresponding baseline, further validating the performance advantages of our method. In the end, some possible future research directions are proposed. The details are described as follows.

Comparisons with State-of-the-Art Methods
Quantitative comparison: We compared the performance of our proposed FedAAR with the state-of-the-art FL approaches (i.e., FedAvg, FedPorx, IDA, SiloBN, FedBN, and precision-weighted FL) and with centralised learning. As illustrated in Table 2, our proposed framework outperformed all of the selected state-of-the-art FL methods, with the highest average values of 75.23%, 75.17%, 74.70%, and 88.88% in precision, recall, F1-score, and accuracy, respectively. These results demonstrate the promising capabilities of FedAAR for animal behavioural classification. In particular, compared with the precision-weighted FL [19], which obtains relatively good performance among the selected state-of-the-art FL methods, our proposed approach achieves remarkable increments of 3.75%, 9.39%, 8.34%, and 4.22% in the average values of the precision, recall, F1-score, and accuracy, respectively. This can be ascribed to the ability of our architecture to effectively alleviate client-drift concerns in local training and conflicts of local gradients during global aggregation. Compared with centralised learning, which provides the upper bounds, the performance of our method is close, with 3.64%, 3.26%, and 3.49% lower average values of the recall, F1-score, and accuracy, respectively. This further reveals the favourable performance of our method. It is also worth noting that the proposed approach demonstrates smaller variances than the state-of-the-art works, with 1.01%, 3.92%, 2.49%, and 1.36% variance in the precision, recall, F1-score, and accuracy, respectively, indicating the good stability and robustness of our approach. Table 2. Comparative results (mean ± std) of our proposed FedAAR with state-of-the-art federated learning (FL) methods. The best two results for each metric are highlighted in bold.

Method Precision (%) Recall (%) F1-Score (%) Accuracy (%)
Centralised learning [23] 83. 34  Qualitative comparison: To qualitatively verify our proposed approach, we visualise the feature vectors of the test set before the last fully connected layer within FedAAR and the state-of-the-art FL models, with the help of t-distributed stochastic neighbour embedding (t-SNE) [29]. As illustrated in Figure 3, the two-dimensional embeddings can reflect the distribution of the network features in the feature space and indicate the generalisation ability of models, in which each point corresponds to a sample and different colours represent different category labels (ground-truth). Better generalisation means that the feature points of samples belonging to the same class cluster closer to each other, whereas the points between different classes are located far from each other. From the embedding visualisation, we can observe that the proposed FedAAR displays more compact clusters within the same categories and larger distances between different categories compared to the selected state-of-the-art methods. This reflects the success of our approach in improving the consistency of update directions across clients from both the local optimisation and global aggregation perspectives, which is beneficial to the generalisation performance promotion of the global model.

Method Precision (%)
Recall (%) F1-Score (%) Accuracy (%) Centralised learning [23] 83. 34  Qualitative comparison: To qualitatively verify our proposed approach, we visualise the feature vectors of the test set before the last fully connected layer within FedAAR and the state-of-the-art FL models, with the help of t-distributed stochastic neighbour embedding (t-SNE) [29]. As illustrated in Figure 3, the two-dimensional embeddings can reflect the distribution of the network features in the feature space and indicate the generalisation ability of models, in which each point corresponds to a sample and different colours represent different category labels (ground-truth). Better generalisation means that the feature points of samples belonging to the same class cluster closer to each other, whereas the points between different classes are located far from each other. From the embedding visualisation, we can observe that the proposed FedAAR displays more compact clusters within the same categories and larger distances between different categories compared to the selected state-of-the-art methods. This reflects the success of our approach in improving the consistency of update directions across clients from both the local optimisation and global aggregation perspectives, which is beneficial to the generalisation performance promotion of the global model.

Evaluation of PLU and GRA Module
Quantitative analysis: To investigate the effectiveness of PLU and GRA on classification performance, we designed four different experimental settings as follows. (1) Baseline: we applied FedAvg as our baseline for the ablative comparison; (2) Baseline + PLU: we used the PLU module instead of the original optimisation process in the baseline during local training; (3) Baseline + GRA: we replaced the original weighted average mechanism in the baseline with our proposed GRA module during global aggregation; (4) FedAAR: we used our proposed framework involving both PLU and GRA modules simultaneously. The quantitative results are shown in Table 3. It is remarkable that the two modules individually yield desirable performance improvements over the baseline, which proves that each of the PLU and GRA modules plays an important role in AAR tasks with data heterogeneity issues. In particular, the GRA module contributes to the tremendous performance improvements, with increases of 3.07%, 9.23%, 8.34%, and 3.47% in the average values of the precision, recall, F1-score, and accuracy, respectively. The variances also significantly decline, by 3.47%, 6.85%, 7.63%, and 2.27% in the precision, recall, F1-score, and accuracy, respectively, when the GRA module is used separately. These experimental results validate the significance of the GRA module in the model's performance and robustness improvements. The inclusion of the PLU module in addition to the GRA module enables all clients to possess consistent guidance directions of feature learning and further obtain relative gains of 1.06%, 0.98%, 0.96%, and 1.10% in the average values of the precision, recall, F1-score, and accuracy, respectively. Table 3. Evaluation results (mean ± std) of the GRA and PLU modules on classification performance. The best results for each metric are highlighted in bold.

Method
Precision (%) Recall (%) F1-Score (%) Accuracy (%) Analysis of the hyper-parameter λ in PLU: The hyper-parameter λ in Equation (6) represents the weight of newly added PGReg loss, corresponding to the constraint degree of global knowledge on local training. We conducted experiments to evaluate the performance of our approach with different λ values (i.e., 0.01, 0.03, 0.05, 0.07, and 0.09), with the results shown in Table 4. The FedAAR achieves clearly the best average values in the recall and F1-score when λ was set to 0.05, while obtaining a performance in the precision and accuracy comparable to the model with λ set to 0.09. Although the precision and accuracy arrive at the highest average values when λ was set to 0.09, the model exhibits a poor recall and F1-score. In addition, the average recall and F1-score values first increase and then decrease as λ varied from 0.01 to 0.09, which illuminates the likely benefit of properly choosing the value of λ for the improvement of overall classification performance. Visualisation of refinements in GRA: To provide further insights into the enormous contributions of the GRA module, we visualise the counts of gradient refinement operations during the training process over three runs in Figure 4. It can be observed that various numbers of gradient modulation operations occur in each communication round, which confirms that conflicts among local gradients arise continuously during the training process. In addition, this observation implies that our framework can constantly and steadily recalibrate the local gradients across clients, effectively enhancing the model's performance. This finding also reinforces the suitability of the proposed GRA module for AAR tasks in the context of data heterogeneity. To provide further insights into the enormous contributions of the GRA module, we visualise the counts of gradient refinement operations during the training process over three runs in Figure 4. It can be observed that various numbers of gradient modulation operations occur in each communication round, which confirms that conflicts among local gradients arise continuously during the training process. In addition, this observation implies that our framework can constantly and steadily recalibrate the local gradients across clients, effectively enhancing the model's performance. This finding also reinforces the suitability of the proposed GRA module for AAR tasks in the context of data heterogeneity.
(a) (b) (c) Figure 4. Counts of refinement operations during the training process over three runs from (a-c).

Analysis of Local Dataset Size
To observe the behaviour of our proposed method over different data capacities, we present in Figure 5a the average testing accuracy of FedAAR and the baseline under various local dataset percentages (i.e., 20%, 40%, 60%, 80%, and 100%). The test accuracies of both FedAAR and the baseline decrease gradually as the number of training samples reduces, but the accuracy of FedAAR continues to exceed that of the baseline, validating the performance advantages of our method under scenarios with a small amount of data. In addition, FedAAR conducted on 60% local data still obtains higher accuracy than the baseline performed on full-sized local data, which reveals that our approach can effectively mitigate the performance degradation due to the reduced data amount.

Analysis of Communication Frequency
The communication frequency between local clients and the global server may influence learning behaviour. We decreased the communication frequency by increasing the local updating epochs and present in Figure 5b the average testing accuracy of FedAAR and the baseline on various local updating epochs (i.e., E = 1, 2, 4, 8, 16). As expected [28], both FedAAR and the baseline achieve higher testing accuracy under smaller local updating epochs, because aggregating at lower frequencies (i.e., larger local updating epochs) easily results in the models' divergence, especially in the early training stages [6]. Notably, FedAAR with a local epoch of 16 still exhibits higher accuracy than the baseline with one local epoch, demonstrating the superiority and reliability of our approach.

Analysis of Local Dataset Size
To observe the behaviour of our proposed method over different data capacities, we present in Figure 5a the average testing accuracy of FedAAR and the baseline under various local dataset percentages (i.e., 20%, 40%, 60%, 80%, and 100%). The test accuracies of both FedAAR and the baseline decrease gradually as the number of training samples reduces, but the accuracy of FedAAR continues to exceed that of the baseline, validating the performance advantages of our method under scenarios with a small amount of data. In addition, FedAAR conducted on 60% local data still obtains higher accuracy than the baseline performed on full-sized local data, which reveals that our approach can effectively mitigate the performance degradation due to the reduced data amount.

Analysis of Client Numbers
A larger number of clients may bring more conflicts among local gradients, which poses a great challenge to the practical application of FL. To further verify the performance advantages of FedAAR compared to the baseline under scenarios with more clients, we simulated the situations with varying client numbers by redistributing the original data based on the basic setting with five clients (see Section 2.5). Specifically, we separately parcelled each of five horse datasets into 2, 3, 4, 5, and 6 smaller ones, each serving as the training data of a single client, thus forming five settings with total client numbers

Analysis of Communication Frequency
The communication frequency between local clients and the global server may influence learning behaviour. We decreased the communication frequency by increasing the local updating epochs and present in Figure 5b the average testing accuracy of FedAAR and the baseline on various local updating epochs (i.e., E = 1, 2, 4, 8, 16). As expected [28], both FedAAR and the baseline achieve higher testing accuracy under smaller local updating epochs, because aggregating at lower frequencies (i.e., larger local updating epochs) easily results in the models' divergence, especially in the early training stages [6]. Notably, FedAAR with a local epoch of 16 still exhibits higher accuracy than the baseline with one local epoch, demonstrating the superiority and reliability of our approach.

Analysis of Client Numbers
A larger number of clients may bring more conflicts among local gradients, which poses a great challenge to the practical application of FL. To further verify the performance advantages of FedAAR compared to the baseline under scenarios with more clients, we simulated the situations with varying client numbers by redistributing the original data based on the basic setting with five clients (see Section 2.5). Specifically, we separately parcelled each of five horse datasets into 2, 3, 4, 5, and 6 smaller ones, each serving as the training data of a single client, thus forming five settings with total client numbers of 10, 15, 20, 25, and 30, respectively. The data from the remaining horse were still used as the test data to validate our method's performance. As shown in Figure 6, the testing accuracy consistently decreases as client numbers increase, but FedAAR drops more slowly than the baseline, indicating the robustness and scalability of our approach under scenarios with more clients.

Analysis of Client Numbers
A larger number of clients may bring more conflicts among local gradients, which poses a great challenge to the practical application of FL. To further verify the performance advantages of FedAAR compared to the baseline under scenarios with more clients, we simulated the situations with varying client numbers by redistributing the original data based on the basic setting with five clients (see Section 2.5). Specifically, we separately parcelled each of five horse datasets into 2, 3, 4, 5, and 6 smaller ones, each serving as the training data of a single client, thus forming five settings with total client numbers of 10, 15, 20, 25, and 30, respectively. The data from the remaining horse were still used as the test data to validate our method's performance. As shown in Figure 6, the testing accuracy consistently decreases as client numbers increase, but FedAAR drops more slowly than the baseline, indicating the robustness and scalability of our approach under scenarios with more clients.

Limitations and Implications
Due to its privacy-preserving nature, FL has been increasingly adapted to various applications, including mobile edge devices, industrial engineering, and health care [30][31][32]. As far as we know, no previous work has developed FL frameworks tailored for automated AAR applications based on decentralised data. Our method is the first to exploit FL to achieve animal behavioural recognition using distributed data without privacy leakage, opening up new opportunities for developing animal monitoring systems with strong robustness and generalisation capabilities.
Our method additionally requires the transmission of prototypes between clients and the server, increasing the dissemination of client information and communication overhead. However, this does not raise privacy concerns because prototypes only represent

Limitations and Implications
Due to its privacy-preserving nature, FL has been increasingly adapted to various applications, including mobile edge devices, industrial engineering, and health care [30][31][32]. As far as we know, no previous work has developed FL frameworks tailored for automated AAR applications based on decentralised data. Our method is the first to exploit FL to achieve animal behavioural recognition using distributed data without privacy leakage, opening up new opportunities for developing animal monitoring systems with strong robustness and generalisation capabilities.
Our method additionally requires the transmission of prototypes between clients and the server, increasing the dissemination of client information and communication overhead. However, this does not raise privacy concerns because prototypes only represent the average statistics across all already compressed local features [33]. In addition, the transmitted prototypes between the server and each client occupied 3 KB, which is only 2.35% of the communication cost of model parameters/gradients. This further reinforces the suitability of our proposed algorithm in practice.
Our method achieves automated AAR based on distributed data in the context of data heterogeneity. The major limitation is that our approach remains at the experimental level. To simulate practical scenarios, i.e., data heterogeneity across farms (clients), we assigned the horse dataset to multiple clients according to individuals since the movement patterns of individual animals are often drawn from distinct distributions. This kind of data allocation refers to some previous works [6,15,34], which generally divide a single centralised dataset into small ones. In addition, we separately selected one horse dataset (excluded in the training set) as the test set to test the trained model's performance in each run of the experiment, which effectively promotes the generalisation capabilities of the model. In the future, we will collect more data from different farms and further verify the effectiveness of our proposed FL strategy.
Considering real-world applications, it is impossible for the client data to be fully labelled due to the time-consuming and costly process of data annotation. Thus, promising solutions are required for integrating FL with potential techniques (e.g., semi-supervised learning) for exploiting the unlabelled data sufficiently. In addition, our method can be extended to AAR tasks with other forms of data (e.g., visual) and different kinds of production management systems that need to use cross-silo data, further promoting the development of agriculture.

Conclusions
In this study, we developed a novel FL framework called FedAAR involving a PLU module and a GRA module to achieve automated AAR by uniting decentralised sensor data while avoiding privacy leakage. The PLU module forces all clients to learn consistent classwise feature representations in local training, effectively reducing drift among client updates. The GRA module eliminates the conflicting components between local gradients during global aggregation, which ensures that all refined local gradients point in a positive direction to improve the agreement among clients. The experimental results reveal that our approach outperforms the state-of-the-art FL methods and achieves performance that is close to that of the centralised learning algorithm. Ablation studies further illuminated the effectiveness of the PLU and GRA modules. In addition, comparative analyses of the performance of FedAAR and the baseline at different levels of three practical conditions (i.e., local data sizes, communication frequency, and client numbers) confirm the performance advantages of our algorithm. These analyses also provide rich insights into the appropriate future applications of our method.

Data Availability Statement:
No new data were created nor analysed in this study. Data sharing is not applicable to this article.