Federated Learning Attacks Revisited: A Critical Discussion of Gaps, Assumptions, and Evaluation Setups

Deep learning pervades heavy data-driven disciplines in research and development. The Internet of Things and sensor systems, which enable smart environments and services, are settings where deep learning can provide invaluable utility. However, the data in these systems are very often directly or indirectly related to people, which raises privacy concerns. Federated learning (FL) mitigates some of these concerns and empowers deep learning in sensor-driven environments by enabling multiple entities to collaboratively train a machine learning model without sharing their data. Nevertheless, a number of works in the literature propose attacks that can manipulate the model and disclose information about the training data in FL. As a result, there has been a growing belief that FL is highly vulnerable to severe attacks. Although these attacks do indeed highlight security and privacy risks in FL, some of them may not be as effective in production deployment because they are feasible only given special—sometimes impractical—assumptions. In this paper, we investigate this issue by conducting a quantitative analysis of the attacks against FL and their evaluation settings in 48 papers. This analysis is the first of its kind to reveal several research gaps with regard to the types and architectures of target models. Additionally, the quantitative analysis allows us to highlight unrealistic assumptions in some attacks related to the hyper-parameters of the model and data distribution. Furthermore, we identify fallacies in the evaluation of attacks which raise questions about the generalizability of the conclusions. As a remedy, we propose a set of recommendations to promote adequate evaluations.


Introduction
Machine Learning (ML) is an approach to imitate the human way of learning. With the help of training data, an ML instance is able to learn and recognise patterns in the data with improved accuracy over time. This socalled model can later be applied to other unknown sets of data to make classifications or predictions without requiring human interactions. ML is increasingly used to improve services in many domains, e.g., natural language processing and image processing [7]. The conventional training approach of ML models is centralized, where large datasets are collected from users and processed by central service providers. These datasets can contain sensitive user information (e.g., health metrics, geographic locations). Therefore, this conventional approach raises users' concerns about their privacy [90]. Federated Learning (FL) is an emerging ML setting that enables multiple entities (clients) to train a joint model while keeping their data locally on their devices. This setting still involves a central server that coordinates the training process by collecting and aggregating model updates from clients to obtain one global model. As the raw data of the clients do not leave their devices, FL is believed to provide privacy benefits [87]. However, the distributed nature of the training process in FL has created a new attack surface, where potentially malicious clients can actively participate and adversely affect the training process. The attacks might target either the integrity of the model (e.g., poisoning attacks [9], [13]) or the confidentiality of the client training data (e.g., model inversion attacks [54], [161]). Recently, researchers have been extensively investigating vulnerabilities in FL and proposing potential attacks. Consequently, the number of publications on attacks against FL remarkably increased; this increase raises serious concerns about the robustness and privacy in FL. Some strong claims even exacerbate these concerns by stating that "federated learning is fundamentally broken" [54], or that some attacks are reaching 100% accuracy [9].
However, a number of these attacks are applicable only under specific conditions and assumptions. For example, some attacks are effective only when the batch size = 1 is used for training a Neural Network (NN) model (e.g., [159]), or when a special distribution of data among clients is applied (e.g., [54]). In many cases, such assumptions do not hold in real-world deployments. Thus, the applicability of such attacks is questionable. In addition, several attacks are evaluated with limited or impractical setups. For instance, some attacks are evaluated using oversimplified datasets (e.g., [157]) or simplified NN models (e.g., [33]). This in turn affects the generalizability of the experiments, results, and conclusions. A recent work by Shejwalkar et al. [115] fueled this discussion by demonstrating that, contrary to the common belief, FL is highly robust against several attacks in the literatureeven without applying any defenses-under practical considerations. Considering the aforementioned issues, the severity of the vulnerabilities discussed in the literature is to be further studied. In particular, the exploitability of these vulnerabilities needs to be considered to assess the severity of the vulnerabilities under realistic setups. Therefore, it is essential to closely and systematically investigate such issues and show their implications. As a result, a more comprehensive and realistic view of the severity of threats in FL can be obtained.
In this paper, we conduct a quantitative analysis of attacks against FL via a systematic mapping study (SMS). We first identify research trends that indicate the growth of the field, the properties of the research community such as affiliations, and targeted publication venues. Then, we provide a structured overview of the attacks with two classification schemes that are based on: (1) the properties of the attacks, (2) the choice of experimental setups used to evaluate the attacks. We analyze the distribution of publications among the defined attack classes and derive the foci and gaps in the research landscape. Next, we highlight several special assumptions made in some of the works and their implications on the applicability of the attacks. Finally, we identify common fallacies in the evaluation setups and the impact of these fallacies on the generalizability of the results. Our work shows that each of the studied papers makes at least one of the special assumptions or suffers from one fallacy. Notably, several fallacies affect the majority of the papers. Our contributions can be summarized as follows.
• Providing a comprehensive quantitative analysis of all the 48 publications on attacks against FL. • Identifying three research gaps that raise questions about the effectiveness of attacks against specific ML functions (e.g., clustering, ranking) and models (e.g., recurrent neural networks, autoencoders). • Highlighting three recurring assumptions that limit the applicability of the proposed attacks to real-world deployments. These assumptions are related to the hyper-parameters of the ML model, the fraction of malicious clients, and data distribution. • Identifying six fallacies in the evaluation practices that can cause overestimation of the attacks' effectiveness. The main fallacies stem from the choice of: datasets, models, and the size of the client population.
In addition, we also propose a set of recommendations to mitigate these fallacies.
Remark. Our work does not negate the existence of vulnerabilities in FL, and thus the security and privacy risks. It also does not undermine the importance of the research on attacks in FL, rather it is an attempt to help researchers clarify the scope of these attacks by reflecting on the assumptions and evaluation practices.
The remainder of this paper is organized as follows. In Section 2, we present a background on NN and FL, a summary of related review studies, and an introduction to SMSs. Then, in Section 3 we elaborate on the methodology of our study. In Section 4, we present the results of the mapping process. Subsequently, we further analyze our results and discuss their implications in Section 5. Finally, we conclude the paper in Section 6.

Background
In this section, we provide the foundations of Neural Networks (NNs) as one of the dominant models in the literature of our field. Then, we introduce FL along with some core insights about FL's security and privacy-related issues. Next, we present a summary of multiple review studies that elaborate on these issues. Lastly, we introduce the objectives and methodology of systematic mapping studies (SMSs).

Neural networks
NNs are a subset of ML algorithms. An NN is comprised of layers of nodes (neurons) including input layer, one or more hidden layers, and an output layer. The neurons are connected by links associated with weights W . The NN model can be used for a variety of tasks, e.g., regression analysis, classification, and clustering. In the case of classification, for example, the task of the model f is to approximate the function f (x) = y where y is the class label of the data sample x. To fulfill this task, the model is trained by optimizing the weights W using a loss function L and training data consisting of input data x i and corresponding labels y i in order to solve [41] Minimizing the loss function can be achieved by applying one of the optimization algorithms. Gradient descent is one of the basic optimization algorithms for finding a local minimum of a differentiable function. This algorithm is based on gradients ∇W , which are the derivative of the loss function w.r.t. the model weights W . The core idea is to update the weights through repeated steps in the opposite direction of the gradient because this is the direction of steepest descent.
where η is the learning rate, which defines how quickly a model updates its weights. . Known also as feedforward network. This is a general-purpose network that contains fully connected layers of neurons.
Convolutional Neural Network (CNN). This architecture is mainly used to detect patterns in the data. It contains convolutional, pooling, and fully connected layers. In the convolutional and pooling layers, a neuron receives input only from a limited number of the neurons in the previous layer.
Recurrent Neural Network (RNN). This network operates on sequential data and times series data. It is distinguished by its recurrent structure, which memorizes information across layers. That is achieved by neurons connected in a loop, i.e., using input from prior neurons to influence the current input and output of the neurons.

Federated learning
Federated Learning (FL) is a distributed machine learning setting where N entities (clients) train a joint model under the coordination of a central server [61]. The training process starts with the server initializing a model, then going through several rounds of training (aka communication round); each training round t consists of the following steps: 1) The server samples a subset of users K t N clients to participate in the training round, 2) The server disseminates the model and training algorithm to the selected clients, 3) Clients train the model locally on their own data, 4) Clients share (only) the resulting model updates with the server, and 5) The server aggregates the updates to derive a new updated global model as follows where n k number of data samples of user k, and n is total number of data samples. FL is mainly employed for large-scale applications, where a massive number of clients participate in training the joint model. As the clients typically have varying amounts of data, the training data is unbalanced. Moreover, the data of a specific client is typically generated based on that client's activity, which does not necessarily represent the distribution of the data of all clients [87]. In the literature, it was demonstrated that FL is robust to these characteristics and can effectively lead to model convergence [87].
Unlike centralized machine learning, where the clients' data needs to be collected at a central server, FL allows clients to maintain their data locally, while training the model in a distributed manner. By that, FL is claimed to mitigate some privacy risks, however, the distributed nature of FL leads to increasing the attack surface (see Fig. 1). Mainly, there are three potential attack points.
Curious server. A central server coordinates the training process and performs the core functionality of FL (as shown in the aforementioned 5-steps workflow). Such a scheme concentrates all the control in the hands of a sole entity, the server. From a security and privacy perspective, this centralization of control can be seen as a major weakness. This is because using services based on FL imposes clients to trust the server to perform the FL functions correctly and apply best privacy practices to protect their updates. In this regard, if the server is malicious, various attacks can be carried out against the clients (e.g., reconstruction attacks [140], [161]). More details on the privacy implications of the centralized coordination in FL can be found in [134].
Compromised client. Furthermore, the distributed nature of FL and involving the clients actively in the training process opens the door for attacks launched by malicious Figure 1. Federated learning overview with three potential adversary access points (in red): curious server, compromised client, and external eavesdropper. Models updates (or gradients) are generated by individual users and shared with a central server that aggregates all the updates into a global model. clients. In FL, the model is sent to the clients' devices where the clients are typically granted full access to the model parameters. This access privilege, in turn, amplifies the malicious clients' capabilities and enables them to perform sophisticated attacks. These attacks may target the model (e.g., poisoning attacks [13]) or the privacy of the participating clients (e.g., membership inference attack [94]).
External eavesdropper. It was believed in the past that the gradients shared between the server and clients do not leak information about the client training data [161]. However, recent attacks demonstrated that external eavesdroppers who have access to these gradients can reconstruct the client data (e.g., leakage from gradients attacks [159], [161]).

Review studies
Several studies in the literature provide overviews of the privacy and security issues in FL, either as a part of general analysis for ML applications or as a specific analysis for the FL setting.
Privacy and Security in ML. Due to the growing recognition of the threats that ML systems might face, many researchers present surveys and systematization of these threats, such as Papernot et al. [101] Al-Rubaie et al. [2], De Cristofaro [29], Rigaki et al. [109], and Zhang et al. [156]. Mainly, these studies focus on the privacy aspect in centralized training and only address FL to a limited extent. Other surveys tackle especially the privacy and security of deep learning models, namely the works of Mirshghallah et al. [90] and Liu et al. [78].
Privacy and Security in FL. Enthoven et al. [32] presented a structured overview about privacy attacks and defense mechanisms in FL, but only for deep learning models. Lyu et al. [83] elaborated additionally on security attacks and pointed out weaknesses in current countermeasures through a qualitative analysis of the literature. A concise taxonomy of attacks on FL was introduced by Jere et al. [60]. Kairouz et al. [61] presented an extensive report of open issues and challenges in FL, including privacy and security issues. However, the extent to which these issues are applicable in real-world scenarios is discussed only briefly. Similar but less comprehensive surveys in terms of the level of detail were also published [3], [72].  The aforementioned studies provide very valuable insights into the security and privacy in ML and FL by summarizing and systematizing the existing research in this field. However, a quantitative analysis of the publications is yet missed, which enables us to highlight current trends and derive research gaps. In addition, little attention is paid in the existing studies to the applicability of the threats. In this paper, we address these two main aspects by conducting an SMS and discuss the applicability of the posed threats.

Introduction to systematic mapping studies
Systematic mapping [102], [103] is a secondary study method that establishes classification schemes and structures in a research field. The analysis of the study focuses on the frequency of publications in each of the defined categories. Such an analysis provides valuable insights about the progress, foci, and gaps of the research field. These insights are not covered by the commonly used systematic literature review (SLR) method, which focuses on surveying primary studies to collect evidence concerning existing solutions [63], while overlooking the frequency of publications. The goals of SMSs include [103]: (1) provide classification(s) and a taxonomy, (2) identify research gaps in the existing literature and possible future research, and (3) identify trends and future research directions. To conduct an SMS, the following steps are taken (see Fig. 2): 1) Define a set of research questions to be answered by the analysis of the study. 2) Conduct a search to find the relevant papers.
3) Refine the selection of papers by following inclusion and exclusion criteria. 4) Define classification schemes to structure the papers into categories. 5) Map the papers to the defined categories. 6) Answer the research questions by analyzing the frequency of papers appearing in the defined categories. SMSs are widely used in medical research [102] and also gained traction in the field of software engineering (e.g., [24], [43]).

Method
Petersen et al. [102], [103] proposed a guideline for conducting SMSs, which serves as a basis for our study outlined in this section.

Objectives and research questions
In this study, we aim to identify subliminal foci in the existing literature as well as open problems and directions in the research area of attacks against FL. As a basis, the attacks proposed in 48 scientific papers identified as

Search strategy
To set up the main search process, a brief pilot search (pre1 in Fig. 3) is carried out, where an initial set of relevant papers is collected. These papers allow us to determine relevant keywords and search terms as well as suitable venues to target in the main search process, and to identify a set of well-known authors in this research area. For this pilot search, Google Scholar is used since it is one of the indexing databases that covers a large number of publishers. Several search queries are applied using combinations of keywords, which have shown to be important in the research field at hand, including "federated learning", "privacy attack", and "security attack". In the following, more details about the main search phase are elaborated, which consists of two main steps: automatic and manual search.
1) Automatic Search (src2.1 in Fig. 3): We conduct the automatic search by relying on several popular search engines, namely ACM Digital Library, IEE-EXplore, Google Scholar, and arXiv. ACM Digital Library and IEEEXplore are considered as they cover the key research communities (i.e., ML and security communities) and most cited publications from ACM and IEEE computer society. In addition, Google Scholar is used to ensuring comprehensive results and avoid any bias towards specific publishers. Furthermore, using arXiv helps to cover the most recent advancements, which are not yet accepted for publication. The keywords to include FL-related terms are: "federated learning" and "collaborative learning". Additionally, precise keywords related to attacks, namely "inference attack", "privacy attack", and "poisoning attack" are also added to the search. Subsequently, a search string was composed using the updated keywords. The results of the automatic search are shown in Table 1.  Table 2. As a complementary procedure, a number of well-known researchers in the field (e.g., H. Brendan McMahan) are identified and their publications (on Google Scholar, private webpages, university webpages) are tracked. The manual search resulted in identifying 20 potentially relevant articles. The combined total number of papers gathered from the automatic and manual searches is 756. These papers are found to contain our search strings or to have relevant titles. However, this is not sufficient to consider them in our study. Therefore, it is crucial to specify strict criteria to select the relevant papers among them.

Selection process
As depicted in Fig. 3, after the search, the selection process is conducted. This process consists of two steps: (1) Applying inclusion and exclusion criteria and (2) performing a complementary forward and backward snowballing search.
The following inclusion and exclusion criteria are applied to titles and abstracts. In those cases, where the title and abstract do not provide enough information, the body of the paper is considered. After filtering the papers based on the inclusion and exclusion criteria, forward and backward snowballing techniques are applied. Forward snowballing identifies papers that have cited the papers found in the search phase. As the majority of the selected papers after filtering are quite recent (2019-2020), they have not yet been cited by many other papers. Therefore, the focus was more on backward snowballing (sel2 in Fig. 3). In this technique, the lists of references in the selected papers are reviewed and the relevant papers are added to our list. Applying this technique resulted in adding only a few more new papers, as our automatic and manual searches already covered almost all the relevant sources.
The final number of papers considered for our analysis is 48. It is worth mentioning that after applying the inclusion and exclusion criteria, there was a remarkable drop in the number of papers. That is due to the fact that there is a huge number of papers that refer to FL and its attacks, but do not contribute to this topic.

Information extraction and classification
Each of the final selected papers is examined in detail to extract information about the proposed attacks and the evaluation practices, in addition to metadata about the paper, e.g., venue of publication. This phase of information extraction is combined with proposing classification schemes that help to categorize the papers. In the following section, three main classification schemes are presented that correspond to our research questions.

Results of the Mapping
In this section, we elaborate on the classification schemes and the results of paper mapping. The frequency of papers in each category is presented as the exact number of papers and the percentage with respect to the total papers number 48. The results are also provided in detail in Tables 4 and 5. This section is structured along with the previously established research questions.

Research trends
We investigate the trends in the research field through four aspects, namely: the year of publications, the affiliations of the researchers, the venues they target to publish their works, and lastly, the type of research conducted according to Wieringa et al. [142].
Year of publication. In Fig. 4, we see the publication year for the studied papers. The first attack against FL was published in 2017 by Hitaj et al. [54]. In the following years, the number of attacks is remarkably increased. That reflects the growing attention towards FL in general and its privacy and security issues, in particular. This considerable number of attacks can also be seen, for an external observer, as an indication of the abundance of FL vulnerabilities. Overall, FL is a hot topic and the number of its applications is growing, therefore, investigating its weaknesses becomes crucial, and this likely will lead to more studies on attacks in the upcoming years. Author affiliations. Our mapping study illustrates that most of the attacks (41 i.e. 85%) come from academia. This can be due to the fact that researchers in academia freely explore the possibilities to hack technologies and then propose mitigation measures, while industry tends to focus on making their services more robust and secure. That is evident from the substantial number of papers on defense mechanisms from industry, especially Google [4], [8], [11], [15], [88], whereas a fewer number of attacks (2 i.e. 4%) were proposed by industry. Joint projects between academia and industry also exist with 5 (10%) papers. Venue type. In our study, we take into account peerreviewed venues (journals, conferences, and workshops) as shown in Table 2, in addition to public repositories (arXiv). The papers distribution among these venues is depicted in Fig. 5. We can see the tendency of the community to push their studies to public repositories, where 30 (62%) of the papers are found. This can be due to the fast pace of publications in this field, which urges researchers to share their ideas and results promptly as preprints. Out of these, 12 (25%) are simultaneously published in a peer-reviewed venue, mainly conferences. After arXiv, the conference papers come first with 23 (48%) papers. The low number of publications in journals might be a result of the novelty of the FL concept and the rapid development of its attacks. Research type. To identify the research characteristics in this field, we categorize the papers based on the type of the conducted research. We adopt the research types proposed by Wieringa et al. [142].
• Solution: Proposes an approach to solve a problem. The approach can be novel or improves on existing ones. The proposed approach should be supported by good arguments or by other means. • Validation: Investigates the validity of a novel approach that has not yet been "realized". The validation can be performed through experiments, simulations, mathematical proofs, etc. • Evaluation: Studies the properties of an existing approach (analyze, assess, and evaluate) to achieve a better understanding of its potentials and limitations. • Philosophical: Provides new insights, new way of thinking, or a new conceptual view of research. • Opinion: States the authors' position towards a specific topic without introducing any research results. • Experience: Describes the personal experience of the authors in conducting "a practice". Our mapping shows that the studied papers fall into only two categories, namely Solution and Evaluation, with 37 (77%) and 11 (23%), respectively. On the one hand, the novelty of this research field can be a reason for the abundance of papers within the Solution category, since there are many privacy and security aspects that need to be addressed. On the other hand, this novelty may explain the absence of papers from other categories, such as Experience, which typically requires more time to put the research approaches into practice and develop experience in the domain.

Attack types
To identify the types and properties of these attacks, we consider several aspects, namely attack's purpose, mode, observation, and access point [29], [94]. Next, we introduce the common attack categories with respect to each of the aspects. The distribution of publications among these categories is depicted in Fig. 6.
Purpose. The attack's purposes can be classified into two main categories. 1) Privacy attacks (inference attacks): These attacks extract information about the training dataset, i.e., user data [29], and fall into three groups based on the obtained information: • Membership inference: The adversary aims to determine whether a particular individual (or a data record) belongs to the training dataset [119]. • Property inference: The adversary aims to infer features of the training dataset, where these features are not intended to be used for the main task of the model [39]. • Model inversion (attribute inference): The adversary aims to infer sensitive features used as input to the model [54]. 2) Poisoning attacks: The adversary maliciously alters the model to achieve one of the following goals.
• Model corruption (label-flipping): The adversary corrupts the model to reduce the its overall accuracy in its main task. This attack can target specific classes or be untargeted [128]. • Backdoor: The adversary implants a backdoor subtask in the model while maintaining a good accuracy of the main task. This backdoor is used later in the production phase to exploit the model, e.g., by forcing misclassification of a specific input [9]. Our mapping shows in Fig. 6 that the majority of the papers focus on privacy attacks with 29 (60%), while 19 (39%) for poisoning attacks. This may be explained by the fact that FL is mainly promoted as mitigation for several privacy risks [87]. Therefore, many researchers investigate the potentials and limitations of privacy in FL by crafting various attacks. Among the different types of attacks, the ones that dominate the research publications are the model inversion 18 (38%) and backdoor 13 (27%). Model inversion is one of the most severe attacks, since the adversary, in some cases, can fully reconstruct the client data. Backdoors are quite powerful in manipulating the model performance in the production phase, which might leave a long-term impact on the systems.
Mode. An adversary might act in two different modes. 1) Passive: The adversary attempts to learn from the observed information, without interrupting or deviating from the regular training process. This mode is widely common in privacy attacks [157], [161]. 2) Active: The adversary acts maliciously in the training process, e.g., they manipulate the training data or model updates. This mode is needed for poisoning attacks [9], [13]. This distribution can be correlated with the capabilities of the adversary in the two modes, i.e., in the active mode, the adversary is more powerful, thus, a wider variety of attacks can be performed. Observation. The adversary's capability to observe the parameters of the target model might vary among different attacks. We consider two possibilities. 1) Black-box: The adversary can query the model, thus, knows the inference result of a particular input. However, they do not observe the model parameters [35]. 2) White-box: The adversary can observe the model parameters [140]. This capability typically enables adversaries to carry out more sophisticated attacks. As the model parameters are typically shared between the server and all the clients in FL, most of the attacks 44 (92%) assume the while-box scenario. The black-box is considered only in 9 (19%) attacks. Access point. The adversary might exist at different locations with different roles in the system. In FL, the adversary can be (1) a curious server, (2) compromised client, or (3) external eavesdropper (see Section 2.2). Fig. 6 illustrates that 44 (92%) attacks are conducted by clients, while only 15 (31%) attacks assume the server to be malicious or curious, and 10 (21%) papers include attacks that can be carried out by external eavesdroppers. This reflects a keen interest in the attacks from the client side, because these attacks are mainly facilitated by the distributed nature of the FL.

Common evaluation setups
The effectiveness of the proposed attacks is mostly demonstrated through an empirical evaluation. This evaluation needs to be extensive and comprehensive to provide sufficient evidence for the attack validity under specific settings. In this section, we examine the experimental settings commonly used for evaluating FL attacks by looking into four aspects, namely target models, datasets, countermeasures, and implementation technologies. Target models. We refer here to the joined ML model that is trained through the FL process, thus, targeted by the attack. The type of the model can vary, as FL by definition is not restricted to specific types. The attacks might be designed to target one or multiple model types, or they can be completely model-agnostic. On a high level, we can classify the target models in the literature into NN models and non-NN models.
The mapping results reveal that only 3 (6%) attacks target non-NN models. As shown in Fig. 7, these three attacks consider logistic regression (LR) [35], [82], while only one of them targets also decision tree (DT), and random forest (RF) [82].  Figure 6. Attack classification with papers distribution. The percentage is with regard to the total number of papers 48. Most categories are not exclusive, therefore, the papers might sum up to more than 48. More details on the individual papers can be found in Table 4.  Long Short-Term Memory) and Autoencoder (AE) (e.g., Transformers) were the target of attacks only twice in the literature. Furthermore, one attack is claimed to be model-agnostic [84]. Datasets. To train the target model, various datasets were used in the literature. These datasets can be categorized into three groups based on the type of the data: Text, image, and key-value pairs. In total, 46 distinct datasets were used in the attack evaluations, and they are: 1) Text: CLiPS Stylometry Investigation [132], Yelpauthor [58], Reddit [46], Amazon Review [98], Yahoo Answers [108]. 2) Image: MNIST [70], Fashion-MNIST [144], LFW [124], CelebA [153], AT&T [113], CIFAR [65], CH-MNIST [105], ChestX-ray8 [138], EndAD [104], EMNIST [23], Fer-2013 [44], HAM10000 [131], ImageNet [30], PIPA [158], SVHN [95], PubFig [66], Omniglot [68], mini-ImageNet [133], VGG2Face [18], fMRI [21], CASIA [73], Face [107], CINIC [26], Breast Histopathology Images [92]. 3) Key-value: Purchase [119], BC Wisconsin [69], Adult [31], FourSquare [148], Human Activity Recognition [5], Landmine [125], Texas-100 [119], UNSW-  Fig. 8, where MNIST and CIFAR are the most common ones, used in 27 (56%) and 22 (46%) attacks, respectively. This conforms also with the common datasets in the ML community [47]. The popularity of MNIST can be due to several reasons, e.g., its small size, such that researchers can train their models quickly and report results. In addition, it is widely supported, also CIFAR, by many ML frameworks, thus, they can be easily used [144]. Countermeasures One of the main methods to evaluate the proposed attacks is measuring their effectiveness against the state-of-the-art defense mechanisms. We explore the mechanisms used in the examined papers; they can be classified into three main categories. 1) Perturbation: This mechanism reduces the information leakage about the clients in FL by applying one of the  Figure 9. Countermeasures classification with the papers distribution. The percentage is with regard to the total number of papers 48. Some categories are not exclusive, therefore, the papers might sum up to more than 48. More details on the individual papers can be found in Table 5. Only 55% of the attacks are evaluated against countermeasures. Noisy updates is the most used technique with 27%.
following perturbation techniques.
• Noisy updates: A client may add noise to their data [106] or the updates before sending them to the server [161]. The noise can also be added on the server side [88]. The amount of noise can be carefully specified to achieve differential privacy. • Restricted updates: Before sharing the updates with the server, a client can limit the number of updates [118], or compress the updates, e.g., by applying quantization [64]. • Regularization: While training the model locally on the client's device, the client can apply regularization techniques such as dropout, and batch normalization [130]. 2) Cryptographic approaches: exposing the updates of an individual client can lead to severe information leakage about their training data [161]. Several techniques based on cryptography are proposed to mitigate this risk.
• Homomorphic encryption: A client can encrypt their updates before sending them to the server. Then, the server computes the aggregation of the encrypted updates from all clients [6]. • Secret sharing: Clients can encrypt their updates with keys derived from shared secrets. The server aggregates the encrypted updates, then decrypts the result of the aggregation only after receiving a sufficient number of the shared secrets [15]. • Trusted execution environment (TEE): The aggregation process on the server can be moved into a TEE, such that the executed code can be attested and verified to not leak individual clients' updates [61]. 3) Sanitization: This mechanism is proposed to mitigate poisoning attacks. In this respect, two defense mechanisms have been developed in the literature • Robust aggregation: To limit the impact of malicious updates on the global model, aggregation methods such as trimming the mean and calculating the median [152] are proposed. • Anomaly detection: Here, the malicious updates are assumed to be anomalies. To identify the anomalous updates (outliers), various techniques can be used, such as clustering [117] or measuring similarity with a reference set of samples [19]. The mapping results are depicted in Fig. 9, where we see that perturbation and sanitization are commonly used to evaluate attacks in FL, in 16 (33%) and 14 (29%) of the papers. This corresponds with the view that many researchers have about perturbation techniques (particularly, differential privacy) as the de facto standard for privacy-preserving ML [71]. Another reason for the high popularity of perturbation techniques can be that they were subject to intensive research not just in the FL community, but in the ML community in general. In contrast, sanitization is limited only to the FL setting. On the other hand, cryptography-based approaches are discussed in 3 (7%) papers. This can be due to the fact that the security and privacy guarantees of these approaches are validated through formal proofs, thus, no empirical experiments are required. Implementation technologies. To ease the reproducibility of the evaluation results, researchers are encouraged to share appropriate descriptions of their implementations along with their source code [25]. In order to learn about the status of the selected papers in this respect, two factors are examined: • Technologies description: Here, we check whether or not the authors state clearly which technologies they use to implement their experiments, such as programming languages and libraries. • Source code availability: We check whether or not the source code has been publicly available. Table 3 shows the number of papers that reveal information about the technologies used in their implementation. A special notice can be put on the popularity of Python as a programming language and PyTorch as a specific Python package in this field, as shown in Table 3. The large share of Python-based implementations can be due to the fact that Python is easy to use and provides a large number of packages for ML tasks. PyTorch is userfriendly and suitable to create custom models, for that and other reasons, it is widely used in the ML research community. On the other hand, 23 (48%) papers do not reveal any information about the technologies used in their implementation. Moreover, the mapping shows that the source code of only 6 (14%) papers has been shared publicly.

Discussion
In this section, first, we derive gaps in the research field from the mapping results in Section 4. Second, we highlight several special assumptions made in the problem settings of some papers that might reduce the applicability of the attacks in real-world scenarios. Third, we identify fallacies in the evaluation of the attacks and discuss their implications on the generalization of the results.

Main research gaps
We base our discussion here on the results of Section 4. In addition, we are looking at how the papers are distributed over pairs of categories by the means of bubble charts, as shown for example in Figure 10, where we show how attacks with specific purposes are distributed with regard to the access point. It is worth mentioning that the categories in some classification schemes are not disjoint, therefore, the total number of publications may sum up to more than 48.

Gap 1. Little research is conducted about attacks on the server side and by eavesdroppers.
Description. Fig. 10 illustrates that membership, property inference, model corruption, and backdoor attacks are rarely studied on the server side or with an eavesdropper adversary. This might be due to two reasons. First, it is widely assumed in the literature that FL is coordinated by a trusted server. Second, approaches that protect against curious servers and eavesdroppers, such as secure aggregation [15], were proposed and widely used by the research community because of the firm protection guarantees they achieve. However, applying such approaches still incurs nonnegligible overhead [122], despite the improvements, which leaves open questions about their efficiency in realworld applications.
Implications. Typically, servers (service providers) are supposedly better equipped to repel attacks comparing with clients. However, numerous events in recent years showed us that providers were subjects to many successful attacks, where user data was breached [86]. Therefore, it is of high importance to study how attacks by a curious or compromised server can impact the FL process. We argue that attacks on the server side are becoming even more relevant in FL especially considering the emergence of applying FL in different architectures, such as hierarchies in edge networks [55], [77], [116]. In such environments, there are multiple entities that play the role of intermediate servers, i.e., collect and aggregate the updates from clients, thus, introducing more server-type access points. For eavesdroppers, recent model inversion attacks on gradients were proved to successfully reconstruct user training data [159], [161]. This opens the door for more investigations about how gradients or model updates can be exploited to apply other attack types, especially privacy attacks.
Gap 2. Very little effort is devoted to studying attacks on ML functions other than classification.
Description. ML models can be used to fulfill a variety of functions, such as classification, regression, ranking, clustering, and generation. However, our SMS shows that there is a heavy bias towards the classification function with 46 (96%) of the attacks. Other functions, namely regression, generation, and clustering were addressed in only 4 (9%), 1 (2%), and 1 (2%) attacks, respectively.
Implications. This gap introduces a lack of knowledge with respect to a large spectrum of models and applications that have different functions than classification. These functions are of high importance in many domains, e.g., ranking in natural language processing [151] and recommender systems [100]. It is an open question how the existing attacks impact these functions. It is worth mentioning that a similar gap was also observed for adversarial attacks in general ML settings by Papernot et al. [101].
Gap 3. There is lack of research about attacks on ML models other than CNNs.
Description. Although FL is not restricted to NN models, we have seen in the previous section that only 3 (6%) attacks target non-NN models. At a closer look, we depict in Fig. 11 the types of models targeted by the different attacks. We notice that non-NN models were never targeted by membership inference or backdoor attacks. For NN models, we observe that RNN were not studied under any type of privacy attacks or model corruption attacks. Additionally, no research has been carried out yet on backdoors for DNNs. The AEs also have received very little attention with only 2 privacy attacks. Overall, this illustrates the limited diversity in the literature considering Implications. NN models are the state of the art in several applications, e.g., face recognition [10], however, other ML models are still of high value and usage in realworld systems, e.g., genome analysis [28], culvert inspection [40], and autocompletion suggestions filtering [150], to name a few.
Within the NN models, there are a variety of network architectures, and as we show above many of these architectures are not well covered in the evaluation of the attacks, even architectures that are widely used in several applications, e.g., RNN, which is used in Gboard [48]. Consequently, the evaluations of the proposed attacks fall short of providing evidence on how the attacks will perform against other network architectures.
Overall, we notice very little effort devoted to studying the influence of using different model architectures on the effectiveness of the proposed attacks. Only in one paper [41], the authors adequately analyzed the effects of the NN architecture on the success of their attack. Covering this aspect in the evaluation of the attacks is essential to improve the generalizability of the findings.

Special assumptions in problem settings
There are a number of attacks that succeed only under special assumptions. These assumptions do not apply in many real-world scenarios, consequently, the applicability of these attacks becomes limited. Here, we highlight the issues of these assumptions and discuss their implications.
Assumption Issue 1. The attacks are effective only under special values of the hyper-parameters of NN models.
Description. As described in Section 2, the hyperparameters of NN models include, among others, the batch size, learning rate, activation function, and loss function. Tuning the hyper-parameters is crucial to achieving high accuracy in the learning task, especially when comparing different models. Dacrema et al. [25] showed that the lack of optimization for the hyper-parameters of the baselines leads to phantom progress in the field of neural recommender systems. Therefore, the hyper-parameters need to be carefully and fairly optimized to meet the application requirements. On contrary, we found in the studied papers several assumptions on special values of hyper-parameters that are not commonly used or might contradict with the application requirements. The reason is that the effectiveness of some proposed attacks is highly influenced by hyper-parameters, and these attacks are possible only under such special assumptions.
Implications. In some model inversion attacks, the gradients are used to reconstruct the training data. Zhu et al. [161] and Wei et al. [141] show that their attacks perform well only when the gradients are generated from a batch size < 8. Zhao et al. [159] propose an attack to extract the labels of the clients from gradients. However, the attack works only when batch size is 1, which is an exceptional and uncommon value. Hitaj et al. [54] also used a batch size of 1 to evaluate their attack on the At&T dataset.
Using small batches leads to a lack of accurate estimation of the gradient errors, this in turn causes less stable learning. Additionally, this requires more computation power to perform a large number of iterations, where gradients need to be calculated and applied every time to update the weights. While FL pushes the training to the client device, it is essential to consider the limited resources of the client devices. Therefore, the efficiency of the local training process is an important requirement. That is, having batches of very small values < 8 increases the computational overhead and are therefore not preferable for FL applications.
Although it is insightful to point out the vulnerabilities that some special hyper-parameters might introduce, it is of high importance to discuss the relevance of these hyperparameters to real-world problems.
Assumption Issue 2. The attacks succeed only when a considerable fraction of clients are malicious and participate frequently in the training rounds.
Description. In cross-device FL, a massive number of clients (up to 10 10 ) form the population of the application. Out of these clients, the server selects a subset of clients (∼ 100 [150]) randomly for every training round to train the model locally and share their updates [87]. This random sampling is assumed to be uniform (i.e., probability for a client to participate is 1 /client population) to achieve certain privacy guarantees for clients, in particular differential privacy [1]. Under these conditions, it is rather unlikely for a specific client to participate in a big number of training rounds ( total number of rounds client population ) or consecutive ones. However, this was found as an assumption in a number of papers to enable some privacy and poisoning attacks. Furthermore, several attacks require a large number of clients to collude and synchronize in order to launch an attack, which also can be tricky to achieve in some cases.
Implications. Hitaj et al. [54] assume that the adversary participates in more than 50 consecutive training rounds in order to carry out a reconstruction attack successfully. A stronger assumption was made by [154], namely to have the adversary participating in all the rounds to poison the model. This requires the adversary to fulfill the FL training requirements [150] and to trick the server to be selected frequently, which is a challenge per se considering the setting described above.
A backdoor attack by [9] was found to be effective when 1% of clients are malicious [126]. Considering a real-world FL application such as Gboard [48], which has more than 1 billion users [27], this means that the adversary needs to compromise 10 7 user devices to apply this attack successfully. This in turn requires a very high effort and considerable resources, which might render the attack impractical.
It is true that the distributed nature of FL might enable malicious clients to be part of the system. However, the capabilities of these malicious clients to launch successful attacks need to be carefully discussed in the light of applied FL use cases. Thus, the risk of these attacks is not overestimated.
Assumption Issue 3. The attacks can be performed when the data is distributed among clients in a specific way.
Description. FL enables clients to keep their data locally on their devices, i.e., the data remains distributed. This usually introduces two data properties; first, the data is non-IID, i.e., the data of an individual client is not representative of the population distribution. Second, the data is unbalanced as different clients have different amounts of data [87]. In an ML classification task, for example, this may cause that some classes are not equally represented in the dataset. In any FL setting, it is essential to consider these two properties. While the meaning of IID and balanced data is clear, non-IID and unbalanced data distribution can be achieved in many ways [61]. In a number of papers, we found that specific distributions are assumed to enable the proposed attacks and draw general conclusions.
Implications. A backdoor attack on a classification model by Bagdasaryan et al. [9] was claimed to achieve 100% accuracy on the backdoor task by one malicious client participating in one training round. However, in this work, it was assumed that only the adversary has the data of the backdoor label, which is a strong assumption according to [38], [126]. The massive number of clients in FL suggests that the client data covers all the model classes. Therefore, it should be considered that at least one honest client will have additional benign data for the backdoor label.
Another example is found in the model inversion attack of [54], where the authors assumed that all data of one class belongs to one client, and the adversary is aware of that. Additionally, their attack works only when all the data of one class is similar (e.g., images of one digit in the MNIST dataset). These assumptions do not apply to many real-world scenarios, thus, found unrealistic by [94]. Moreover, the model corruption attack introduced in [128] was launched under the setting of IID data, which contradicts the main FL assumptions. Similarly, Nasr et al. [94] evaluated their membership inference attack on a target model trained with balanced data. It is worth mentioning that Jayaraman et al. [59] showed that most membership inference attacks [79], [112], [119] for standalone learning also focus only on the balanced distribution scenarios.
Overall, the way of implementing non-IID and unbalanced data distribution needs to be (1) discussed and justified in the light of the application to assure as realistic as possible setup, (2) reflected clearly in the conclusions of the evaluation.

Fallacies in evaluation setups
Designing a comprehensive and realistic experimental setup is essential to prove the applicability of the attack and the generalization of the conclusions. Although all the studied papers provide insightful evaluations of their proposed attacks, a number of practices were followed that might introduce fallacies. In this section, we set out to highlight this issue by identifying six fallacies. We discuss the implications of each fallacy on the evaluation results. Then, we propose a set of actionable recommendations to help to avoid it. Description. The datasets are used to train and test the FL model, and also to evaluate the attack. These datasets need to be representative of the population targeted by the model. As we highlighted in Section 4, the majority of attacks are evaluated on the image classification task. Therefore, here we focus on the image-based datasets.
Despite the growing calls for decreasing the usage of simple datasets, in particular MNIST [144], it is still one of the most common datasets in the deep learning community [47]. This is due to several reasons such as its small size and the fact that it can be easily used in deep learning frameworks (e.g., Tensorflow, PyTorch) by means of helper functions [144].
MNIST was introduced by LeCun et al. [70] in 1998 and contains 70,000 gray-scale images of handwritten digits in the size of 28 × 28 pixels. Since then, substantial advances were made on deep learning algorithms and the available computational power. Consequently, MNIST became an inappropriate challenge for our modern toolset [49]. In addition, the complexity of images increased in modern computer vision tasks. That renders MNIST unrepresentative of these tasks [14].
Yet, the phenomenon of the wide usage of MNIST is also observed in the examined papers, where more than 53% of the papers (see Fig. 8) use MNIST as the main dataset for evaluating the effectiveness of the proposed attacks. The second most common dataset is CIFAR, which is more complex in terms of data content, however, it is a thumbnail dataset, i.e., images with a size of 32×32 pixels.
It is worth mentioning that in 41 (85%) of the papers the authors evaluated their attacks on more than one dataset, which is considered good practice. However, in a considerable number of papers (15 i.e. 31%) the authors used only datasets that either contain simple or small (thumbnail) images.
Implications. Using oversimplified datasets can lead to the misestimation of the attack capabilities. For instance, the capabilities of privacy attacks to retrieve information about the dataset are tightly related to the nature of this dataset. Consequently, the complexity and size of the images in the dataset impact the attacks' success rate. It is clear that obtaining complex and bigger images require higher capabilities. This is evident in the literature through several examples. Melis et al. [89] introduced a privacy attack that exploits the updates sent by the clients to infer the membership and properties of data samples. In [161], the authors demonstrated that the proposed attack of [89] only succeeds on simple images with clean background from the MNIST dataset. However, the attack's accuracy degrades notably on the LFW dataset and fails on CIFAR.
In the same context of privacy attacks, Zhu et al. [161] proposed the model inversion attack DLG, which reconstructs the training data and labels from gradients. Their experiments showed that DLG can quickly (within just 50 iterations) reconstruct images from MNIST. However, it requires more computational power (around 500 iterations) to succeed against more complex datasets such as CIFAR and LFW. Recently, Wainakh et al. [135] demonstrated that the accuracy of DLG in retrieving the labels degrades remarkably on CelebA, which has a bigger image size than the thumbnails datasets, MNIST and CIFAR.
Recommendations. We acknowledge that it is challenging to find a single dataset that provides an adequate evaluation of the attacks, therefore, it is essential to evaluate the attack on diverse datasets with regard to image complexity and dimensions. We encourage researchers to also consider real-life datasets, which pose realistic challenges for the models and attacks, e.g., ImageNet [30] (image classification and localization), Fer2013 [44] (facial recognition), and HAM10000 [22] (diagnosing skin cancers). Description. In FL, data is distributed among the clients; each client typically generates their data by using their own device, therefore, this data has individual characteristics [87]. The datasets used for evaluating the attacks should exhibit this property, i.e., generated in a distributed fashion. However, only in 4 (8%) of the papers, user-partitioned datasets were used, in particular EMNIST [23], which is collected from 3383 users, thus, appropriate for the FL setting [126]. While researchers in the majority of papers (44 i.e. 92%), used pre-existing datasets that are designed for centralized machine learning [81], thus, unrealistic for FL [16]. These datasets then are artificially partitioned to simulate the distributed data in FL. One additional issue with these datasets that they are by default balanced, while FL assumes the client data to be unbalanced [87].
Implications. The poisoning attacks proposed in [9] and [14] were evaluated on centralized datasets, such as Fashion-MNIST and CIFAR, for image classification, where the attacks were reported achieving 100% accuracy in the backdoor task. However, by using EMNIST as a standard FL dataset, Sun et al. [126] illustrated the limitations of the previous attacks. More precisely, they showed that the performance of the attacks mainly depends on the ratio of adversaries in the population. Moreover, the attacks can be easily mitigated with norm clipping and "weak" differential privacy.
Although this fallacy was discussed in previous works [16], [81], its implications on the evaluation results need to be investigated further and demonstrated with empirical evidence.
Recommendations. FL-specific datasets should be used for adequate evaluation of the attacks. Researchers have recently been devoting more efforts to curating such datasets. The LEAF framework [16] provides five userpartitioned datasets of images and text, namely FEMNIST, Sent140, Shakespeare, CelebA, and Reddit. Furthermore, Luo et al. [81] created a street dataset of high-quality images, which is also distributed by nature for FL. Description. We observe a major focus on attacking NN models in federated settings. These models can have a variety of architectures as discussed in Section 2. The complexity of these architectures vary with respect to the number of layers (depth), the number of neurons in each layer (width), and the type of connections between neurons. Our study shows that researchers tend to use simple architectures to evaluate their attacks in 30 (62%) papers, e.g., 1-layer CNN [33] or 1-layer MLP [12]. Only in 18 (38%) papers, the authors considered complex state-of-the-art CNN models, such as VGG [120], ResNet [52], and DenseNet [57], the winners of the famous ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [110].
Implications. It is reasonable to start evaluating novel attacks on simple models to facilitate the analysis of the initial results. However, this is insufficient to draw conclusions on the risk posed by these attacks to real-life FL-based applications for two reasons. First, modern computer vision applications, e.g., biometrics, use advanced models, mostly with sophisticated architectures, to solve increasing complex learning objectives [62]. Second, in deployed systems, a ML model typically interacts with other components, including other models. This interaction can be of extreme complexity, which might introduce additional challenges for adversaries [34]. For instance, in the Gboard app [48], as a user starts typing a search query, a baseline model determines possible search suggestions. Yang et al. [150] utilized FL to train an additional model that filters these suggestions in a subsequent step to improve their quality.
Several model inversion attacks reconstruct the training data by exploiting the shared gradients [33], [136], [141]. In particular, they exploit mathematical properties of gradients in specific model architectures to infer in-formation about the input data. For example, Enthoven et al. [33] illustrate that neurons in fully connected layers can reconstruct the activation of the previous layer. This observation is employed to disclose the input data in fully connected models with high accuracy. However, the same attack achieves considerably less success when the model contains some convolutional layers.
The NN capacity (i.e., number of neurons) also influences the performance of some attacks, in particular backdoors. It is conjectured that backdoors exploit the spare capacity in NNs to inject a sub-task [75]. Thus, larger networks might be more prone to these attacks. However, this interesting factor still needs to be well investigated [126]. In this regard, it is worth mentioning that increasing the capacity, e.g., for CNNs, is a common practice to increase the model accuracy. However, recent approaches such as EfficientNet [127] call for scaling up the networks more efficiently, achieving better accuracy with smaller networks. This development in the CNNs should be also considered in the evaluation of the attacks.
Recommendations. We highly encourage the researchers to consider the state-of-the-art model architectures that are widely used in the application, where they apply their attack. In addition, it would be insightful for a more realistic security assessment to consider evaluating the proposed attacks on deployed systems that contain multiple components. Description. FL can be applied in cross-silo or crossdevice settings. In the cross-silo setting, clients are organizations or datacenters (typically 2-100 clients), whereas in the cross-device scenario, clients are a very large number of mobile or IoT devices (massive up to 10 10 ) [61]. For instance, in applied use cases of FL, Hard et al. [48] reported using 1.5 million clients to train the Coupled Input and Forget Gate language model [45]. Yang et al. [150] trained a logistic regression model (for the Gboard application) for 4000 training rounds, where they employed 100 clients in each round.
Although many of the studied papers do not explicitly use the term "cross-device" to describe their scenario, they refer mainly to clients as individual users who have personal data. However, 27 (56%) of the papers provide an evaluation with a total population of ≤ 100 clients. Moreover, 13 (27%) of the papers did not report at all the client population in their experiments.
Implications. The total number of clients and the clients participating per round in FL determine the influence of a single client on the global model. For privacy attacks, this means that each client contributes considerably to shape the model parameters, thus, the parameters more prominently reflect the client personal data. Shen et al. [116] demonstrated that increasing the client population led to a decrease in the accuracy of their property inference attack. For poisoning attacks, using a small number of clients amplifies the impact of the poison injected by malicious ones. This was shown in the experiments of [14], where the accuracy of the backdoor  Figure 12. Bubble chart that shows the papers distribution on two dimensions: attack purpose and countermeasures. We see that perturbation and cryptography-based countermeasures are mainly used for privacy attacks, while sanitization is used for poisoning attacks.
task degraded with bigger client populations.
Recommendations. We recommend researchers to consider a large number of clients to evaluate novel attacks. For that, it is helpful to use the datasets provided by LEAF [16], which contain more than 1000 clients. In case large-scale evaluation is not feasible, researchers are encouraged to discuss at least the potential implications of different client populations on their attacks. Description. An attack becomes ineffective if it requires the adversary to make a disproportional large effort to overcome a small defense mechanism [34]. Proposed attacks need to be evaluated in this respect with stateof-the-art defenses. However, we showed in Section 4.3, Fig. 9, that 21 (48%) of the proposed attacks were not evaluated against any of the defense mechanisms. In most of these papers, the authors only discussed theoretically potential countermeasures to mitigate their attacks.
Implications. This fallacy leaves the evaluation of the attacks incomplete, and their applicability under realworld scenarios, where defense mechanisms are typically deployed, becomes questionable. However, it is important here to distinguish between the different categories of defense mechanisms. On the one hand, cryptographybased defenses typically provide formally proved properties, thus, in some cases, their impact on the attacks can be sufficiently discussed without empirical evidence. Yet, in these cases, the efficiency remains an open question. On the other hand, the impact of other defense categories, namely perturbation and sanitization, against attacks require experimental analysis, as these defenses usually introduce a loss in the model accuracy, thus, need to be customized to reach a desired balance between the accuracy and privacy. In Fig. 12, we see that most of the implemented defenses in the literature are from these two categories. We see also that perturbation is mainly used for privacy attacks, which reduces the information leakage about individuals, whereas sanitization mitigates the impact of malicious updates from adversaries, thus, used against poisoning attacks.
Recommendations. We highly recommend evaluating novel attacks against the appropriate state-of-the-art defenses. For implementing perturbation approaches, emerging libraries such as Opacus 1 and Tensorflow Privacy 2 can be used. Description. The majority (97%) of the proposed attacks are validated through empirical experiments. To accurately reproduce the results of these experiments by other researchers, several practices need to be considered. In our analysis, we take into account three main practices: (1) using publicly available datasets, (2) reporting technical details about the implementation, and (3) publishing the source code. Our study shows in Section 4.3 that public datasets were used in all the examined papers, which is a good practice. However, 23 (48%) of the papers did not report any details about the technologies used in the implementation. Furthermore, the authors of 41 (85%) of the papers did not publish their source code.
Implications. Dacrema et al. [25] reported that reproducibility is one of the main factors to assure progress for research, especially with approaches based on deep learning algorithms. To conduct a proper assessment of a novel attack, researchers usually compare it with previous attacks as baselines. Evaluating the different attacks under different settings and assumptions hinders this direct comparison. That is, researchers have to re-implement the respective attacks to reproduce their results under different settings. This becomes even more challenging when the authors do not describe their experiment setups and parameters to the extent of full reproducibility.
Recommendations. We encourage all researchers to share their source code and detailed descriptions of their setups. We also recommend using libraries and benchmark frameworks that support FL, namely Tensorflowfederated [99], PySyft [111], LEAF [16], FATE [149], and FedML [51]. This in turn will help researchers to implement their ideas more easily and improve the consistency of implementations and experiment settings across different papers.

Conclusion
In this paper, we carried out a systematic mapping study based on recent publications that address attacks in the federated learning setting. For that, we analyzed 48 relevant papers published between 2016 and the third 1. https://github.com/pytorch/opacus 2. https://github.com/tensorflow/privacy quarter of 2021. We structured these papers in classification schemes regarding attacks and evaluation settings.
Our analysis indicated the prevalence of works focusing on the classification function and on neural network models, CNN models in particular, which hardly reflects the diversity of ML algorithms. We additionally examined the assumptions of the proposed attacks to identify those with restricted applicability in the context of real-world scenarios. These assumptions range from choosing unorthodox values of hyper-parameters to constructing special kinds of data distribution among clients. We further identified six fallacies in the evaluation of the attacks, which affect the generalizability of the results and led to overestimating the effectiveness of the attacks. For instance, the usage of overly simple or centralized datasets was found in the majority of the publications. Moreover, close to half of the attacks were proposed without considering the state-of-the-art defense mechanisms. Notably, there is ambiguity regarding reproducible research. As a constructive step, we presented several actionable recommendations to mitigate these identified fallacies by using modern models, federated learning-specific datasets and frameworks.
Overall, our study revealed that each of the examined papers contains at least one of the special assumptions or is affected by one of the evaluation fallacies. Thus, the effectiveness of the attacks in real-world scenarios needs to be further investigated and supported by empirical evidence. However, we do not downplay the vulnerabilities and threats in federated learning discussed in the literature. Instead, our findings contribute to a more informed assessment of the severity of these vulnerabilities. We hope that our analysis will raise awareness of the common issues in the literature, as well as help researchers in orienting their future research by better understanding the current research progress in the domain of security and privacy of federated learning.