The Partnership of Citizen Science and Machine Learning: Beneﬁts, Risks, and Future Challenges for Engagement, Data Collection, and Data Quality

: Advances in artiﬁcial intelligence (AI) and the extension of citizen science to various scientiﬁc areas, as well as the generation of big citizen science data, are resulting in AI and citizen science being good partners, and their combination beneﬁts both ﬁelds. The integration of AI and citizen science has mostly been used in biodiversity projects, with the primary focus on using citizen science data to train machine learning (ML) algorithms for automatic species identiﬁcation. In this article, we will look at how ML techniques can be used in citizen science and how they can inﬂuence volunteer engagement, data collection, and data validation. We reviewed several use cases from various domains and categorized them according to the ML technique used and the impact of ML on citizen science in each project. Furthermore, the beneﬁts and risks of integrating ML in citizen science are explored, and some recommendations are provided on how to enhance the beneﬁts while mitigating the risks of this integration. Finally, because this integration is still in its early phases, we have proposed some potential ideas and challenges that can be implemented in the future to leverage the power of the combination of citizen science and AI, with the key emphasis being on citizen science in this article.


Introduction
The simulation of human intelligence in machines, known as artificial intelligence (AI), is widely applied in various domains, and the number of scientific publications in this area are significantly increasing [1]. AI is a term used when machines can perform tasks which simulate the human mind such as learning, reasoning, and solving problems [2]. Thus, machine learning (ML) is a sub field of AI, defined as the study of developing computer algorithms, which use data to learn patterns, make predictions, and improve their performance over time by more data [3]. The majority of ML algorithms require large amounts of labeled data, and this is resulting in a close partnership of ML with citizen science projects [4,5]. Citizen science-public participation in scientific research-has grown significantly in recent years as a result of technological advancements such as new smartphone features and fast Internet access in most parts of the world [6]. This growth in citizen science has resulted in large dataset collections in a variety of scientific domains [7], which can be a valuable input source for ML algorithms.
Although the combination of ML and citizen science is not new [8], until recently, these two fields have mostly been implemented separately [9]. The integration of ML and citizen science can result in producing a new learning paradigm for citizen scientists through human-computer interactions [10]. Moreover, it can result in increasing interdisciplinary collaborations among researchers as well as members of the public in various fields such as computer science, ecology, astronomy, and medicine, to name a few [9]. This integration has been focused primarily on object detection in images and videos with the main focus on automatic species identification in biodiversity projects [11,12]. A well-known example is the iNaturalist project [13], which has included automated species identification suggestions since 2017 using images obtained from observers. The automatic identification has improved over the years as more images are used to train the model, and the latest model release was in March 2020 by the time of writing this article [14]. The automatic species identification in iNaturalist has provided citizen scientists the opportunity to learn about species and to minimize the contribution of erroneous observations [15].
The objective of combining citizen science and ML is not limited to providing data for the ML algorithms and automatizing the identification tasks. The aim is to combine human and machine intelligence to bring new adjustments to citizen science tasks, such as automated data collection, processing, and validation, as well as to increase public engagement. There are potential challenges and opportunities in the integration of ML and citizen science, which are essential to discuss. In this article, we aim at discussing the following research questions: • What are some examples of successful citizen science projects where ML is integrated? • What ML techniques have been used in these projects? • What citizen science tasks have been affected by ML in such projects? • What are the benefits and risks of integrating ML in citizen science for practitioners and citizen scientists? • What are the possible future challenges that might arrive as a result of the combination of ML and citizen science? • What are the gaps and limitations of including ML in citizen science?
To answer these research questions, we explore use cases where ML and citizen science can be combined. We have reviewed successful citizen science projects, highlighting the typologies of techniques used in such projects and categorizing them in light of the effect of ML on citizen science tasks. Although the opportunities and challenges of merging ML and citizen science have been addressed in a few recent articles [8][9][10], the main emphasis has been on the transparency of using ML in citizen science in terms of how the ML algorithms use citizen science data [10], the effects of AI on human behavior and improving insights in citizen science [8], and the effects of this combination in ecological monitoring in terms of having cheaper or more efficient ways for data collection and data processing [9]. While these are key issues to explore, to the best of our knowledge, the integration of ML and CS has received less attention in terms of how this integration can affect the usual processes in a CS project, from volunteer involvement to influencing the quality of their contributions. Our primary objective is to explore how some CS tasks can be automated using ML and whether this automation is beneficial or detrimental for the project and its participants. Rather than being overly broad, we broke down the forms of ML combinations in various CS steps and discussed the benefits and risks of this integration in each step in this article. We outline how ML can be integrated in each step, including what has already been applied, what can be applied in the future, and what the current and potential risks and benefits of this integration are for each step.
The following is how the article is organized. In the following section, we will go through the ML paradigm, as well as the most popular ML applications, in greater detail. In Section 3, we explore the potential impacts of ML on citizen science projects, and in Section 4, we review successful use cases where ML and citizen science are combined. In Section 5, we discuss the benefits and risks of integrating citizen science and ML. Finally, in Section 6, we present the conclusions, with an emphasis on possible future transitions in citizen science projects in the age of AI.

Types of Machine Learning and Applications
As stated in the introduction, ML is a subset of AI, which was first introduced in 1955 by Arthur Samuel when he applied learning to his droughts (checkers) algorithm [2]. Samuel defined ML as a "field of study that gives computers the ability to learn without being explicitly programmed" [16]. ML algorithms build models which learn using the input data (known as training data) and are able to make predictions based on the learnt experience. There are three main machine learning types, known as supervised learning, unsupervised learning, and reinforcement learning [16].

•
Supervised learning: In supervised learning, the training data are labeled, and the task is to map the input (independent variables) to the output (dependent variables). The two typical types of supervised learning are classification, where the output variable is categorized, and regression, where the output variable is continuous [16]. The most widely known algorithms of supervised learning are k-nearest neighbors (KNN), linear regression, logistic regression, support vector machines (SVMs), decision trees, random forest (RF), and neural networks (NN).

•
Unsupervised learning: In unsupervised learning, the training data are not labeled, and the goal is to identify structures and patterns in the data [16]. The typical types of unsupervised learning include clustering (grouping similar input data), dimension reduction (extracting meaningful features from the data), and association (exploring the data to discover relationships between attributes) [16]. Some of the most known algorithms of unsupervised learning are k-means, one-class SVM, hierarchical cluster analysis (HCA), and principal component analysis (PCA). • Reinforcement learning: In reinforcement learning, the learning algorithm, also called the agent, observes the environment and learns through a system of rewards and punishments. Reinforcement learning is commonly used in robotics, such as walking robots and self-driving vehicles, as well as in real-time decision making and game AI [16].

•
Deep learning, a subset of ML (See Figure 1 for the relationship between AI, ML, and deep learning), is concerned with algorithms known as artificial neural networks that attempt to simulate the structure and functions of a biological brain [17]. Since there is a significant body of literature on AI and ML algorithms, we briefly discuss some of the common AI, ML, and deep learning techniques applied largely in scientific projects: • Computer vision (CV): CV is an interdisciplinary scientific field which aims at developing techniques so that computers can identify and understand the contents in digital images and videos. In other words, CV aims at enabling computers to identify elements in images the same as humans would do. The advances in artificial neural networks and deep learning have had great impact on CV, which in some cases outperforms the human power to identify objects [18]. Some popular applications of CV include self-driving cars, face recognition, etc. [16]. Moreover, starting in the year 2020 and with the COVID-19 pandemic, CV has been applied in monitoring and detecting social distancing among people [19]. CV has also been commonly used in species identification, with Plant@net [20] and iNaturalist being two well-known citizen science examples. A class of deep learning which is commonly used in CV is the convolutional neural network (CNN). • Natural language processing (NLP): NLP is a subfield of linguistics, computer science and AI that deals with human-computer interactions through the use of natural language, which means that NLP aims to enable computers to read and understand human language [21]. The mechanism involves the machine capturing the human's words (text or audio), processing the words and preparing a response, and returning the produced response (in the form of audio or text) to the human. Language translation applications such as Google Translate or DeepL [22], as well as personal assistant applications (e.g., Siri or Alexa), are common uses of NLP in people's daily lives. • Acoustic identification: Acoustic identification is a technique based on pattern recognition and signal analysis, where the acoustic data are processed and features are extracted and classified. Main applications of acoustic identification are in species detection [23]. For example, BirdNet [24] is an application to identify bird species based on the bird song. • Automated reasoning: Automated reasoning is a branch of AI that seeks to train machines to solve problems using logical reasoning [25]. In other words, in automated reasoning, the computer is given knowledge and can generate new knowledge from it, which it then uses to make rational decisions. Automated reasoning is mainly used to assess if something is true or false or whether an event will occur or not.

The Influence of ML on Citizen Science Steps
When it comes to the combination of ML and citizen science, the role of citizen science as a possible solution to the problem of a lack of training data in ML algorithms [7,26] has been discussed more intensively than the role of ML in addressing challenges in citizen science projects. Ceccaroni et al. [8] explored the AI technologies used in citizen science projects and the opportunities and risks that are expected to be encountered due to the increase in the use of AI in citizen science. The authors define three categories for the use of AI in citizen science including "assisting or replacing humans for completing tasks", "influencing human behavior", and "improving insights". The first category describes the role of AI in fully or partially automating tasks that were previously performed by humans: for example, tasks related to automatically detecting and classifying data, such as classifying species based on images or sounds [27][28][29]. The second category discusses the aim of AI, data science, and citizen science to influence human behavior [30] and to extend the educational and social benefits of citizen science to the general public [31]. The third category discusses the impact of AI on identifying patterns in citizen science data for informing research and policies or on facilitating the understanding of citizen science concepts using ontologies. Another study by McClure et al. [9] discusses the integration of AI and citizen science in ecological monitoring. Rather than delving into the details of how AI and citizen science can be combined, the authors addressed the challenges and opportunities of performing ecological monitoring using only citizen science, only AI, or a combination of the two. The opportunities and challenges are discussed in the context of six categories, including efficiency, accuracy, discovery, engagement, resources, and ethics. Efficiency refers to the benefits that citizen science and ML can provide for scientific projects, such as facilitating data collection and automating laborious tasks, as well as the ability to perform extensive data processing when human and machine power are combined. Accuracy refers to the possibility of integrating human and machine intelligence to produce high-quality data or the challenge of providing incorrect and misleading information. Discovery explores the advantages of complex species identification and serendipitous discoveries made through the partnership of citizen scientists and deep learning. Engagement explores the impact of citizen science and AI on multidisciplinary engagement. Resources highlights the role of citizen scientists and machines in saving human and financial resources by, for example, freely contributing data and automating complex tasks, but it also covers the challenges of training citizen scientists, large data requirements, and the need for ML experts. Ethics highlights the challenges of potential information misuse when integrating AI. Another recent study by Franzen et al. [10] also discusses the opportunities and challenges of human-computer interaction in citizen science with a focus on the concept of transparency when integrating ML in citizen science projects, which means that information about data use, ML algorithms, and data processing must be transparent and communicated to participants.
In this article, we will look at the impact of ML and citizen science integration on citizen science steps, but first, it is important to understand the different types of citizen science projects, as well as the main steps and tasks in a project. Bonney et al. [32] described three types of citizen science projects: contributory projects, in which scientists design the project and members of the public contribute primarily to data collection; collaborative projects, in which scientists design the project and members of the public contribute not only to data collection but also to data analysis and/or interpretation of the findings; and finally, co-created projects are those in which the project is designed in collaboration with scientists and members of the public, and some members of the public are involved in most, if not all, of the project steps. Citizen science projects are comprised of five key steps, with participants engaging in all or some of the steps depending on the project type. The following are the primary steps for each citizen science project [33,34]: • Defining the problem: Exploring the problem that needs to be solved by answering questions, such as why this issue is important, who the stakeholders are, and what will be achieved.

•
Designing the project: Identifying the objectives, allocating the necessary resources (funding, team members, equipment, etc.), and defining the project planning. • Building a community: Encouraging the general public to participate in the project and sustaining their engagement by establishing a trusting relationship with the volunteers.

•
Data collection, quality assurance, and analysis: Designing data collection tools, training volunteers, determining how to store data, filtering and cleaning collected data, analyzing data to detect trends, and sharing data with participants or other practitioners.

•
Sustain and improve the project: Maintaining project funding by searching for different sources of funding, and sustaining participation by communicating with volunteers and receiving/giving feedback from/to them.
Thus, our goal is to expand the existing literature on the integration of citizen science and ML by focusing not only on the scientific outcomes of citizen science projects, but also on the participants, who are at the heart of the projects. We therefore address the integration of ML into various components of a citizen science project, and focus on the impacts of ML on three categories: engaging people and sustaining their participation, data collection, and data validation ( Figure 2).

ML for Engaging the Public and Sustaining Participation
A key aspect in a successful citizen science project is to understand how to motivate the public to participate in a project and how to sustain their participation [35]. Depending on the objectives and designs of the citizen science project, various approaches have been used to engage people [36]. We discuss two potential approaches in using ML towards engaging participants and sustaining participation: • Automatic community search: The traditional approaches such as word-of-mouth, social media posts, direct emails, workshops, etc., while beneficial for building a community, can be time consuming or require financial resources (for instance, for organization of workshops or ads in newsletters). Antoniou et al. [37] have proposed a guidance tool to provide information to volunteers so that they can find the VGI (volunteered geographic information) project of their choice based on their motivations and interests. To automate what they have proposed, ML algorithms can be used to find and classify the potential target participants based on their interests and to introduce a project to them accordingly. Several studies have been conducted to apply ML algorithms to extract relevant information from social media (e.g., Twitter or Instagram) posts, such as where the images were taken, what type of content is contained in the image, or what topic is mostly discussed in the textual posts [38,39]. As a result, similar approaches can be adapted to citizen science projects by employing ML techniques such as CV and NLP to identify people's interests from social media posts and linking them to the relevant citizen science project. Furthermore, to the best of our knowledge, the use of ML in user profiling to create a recommendation system [40,41] where citizen science projects are recommended to people based on their sociodemographic details is not used as a way to engage people to contribute to citizen science projects. Moreover, the use of chat bots in citizen science projects can be a potential approach in engaging and sustaining participation, which has been applied in few studies [42,43]. Chat bots may also help as a real-time guide for participants. • Automatic feedback to participants: As discussed in some studies, participants may become discouraged if they do not receive feedback on their contributions [44,45]. Moreover, due to massive amounts of data, it is time-consuming to provide feedback to all participants, or often, feedback from experts is provided after a long time has passed [45,46]. In order to inform participants regarding the quality of their contributions and to update them regarding the project advancements, automatic informative and user-based feedback can be generated using ML algorithms [47]. The participants can be informed about the quality of their contribution and how they can enhance it and can learn from the feedback provided (e.g., learning about biodiversity through feedback regarding species habitat characteristics). Thus, human-computer interaction through machine-generated feedback can be a strategy for increasing and sustaining participation in citizen science projects.

ML for Data Collection
Data collection in citizen science projects usually can be categorized into two types. The first category is known as crowdsourcing [48], and it involves data collection that requires little or no cognition engagement, such as collecting biodiversity data (e.g., photographs of species), recording noise [49] or air pollution levels, or in volunteer computing projects, in which volunteers provide their computer's unused resources for scientists to perform heavy computations [50]. The second type is when human cognition is employed to collect information, which primarily consists of labeling and identifying objects in images; in more complicated projects, training prior to data contribution is required to complete tasks, such as identifying protein structures in the Foldit project [51] or georeferencing historical images in the sMapShot project [52]. Thus, by incorporating ML techniques into citizen science, the data collection task can be partially or fully automated. As a result, considering the two key types of data collection, we define two possible approaches in which ML can be integrated in this step:

•
Machines as sensors (adapted from citizens as sensors): The integration of ML in the first form of data collection, crowdsourcing, can be performed using AI-based tools, such as AI-based cameras. A well-known example in ecological studies is the use of camera traps to automatically capture images of species [53]. Moreover, sensors integrated with ML techniques can automatically record measurements such as noise recording [54] or air pollution [55]. • Machine thinking (adapted from volunteer thinking): For the second form of data collection, where cognition is involved, ML algorithms can learn to automate certain tasks, such as object detection in images/videos, which is the most common technique, or more complex tasks, such as automated prediction of protein structures using deep learning [56].

ML for Data Validation
Due to large amounts of data being contributed to citizen science projects, manual expert validation can be very time intensive. Thus, automatic or semiautomatic data validation can be applied by filtering potential erroneous data, considering both the contributed information and the ability and experience of participants in contributing data. Two types of potential automatic validation approaches can be the following: • Automatic data quality assurance: The static comparison of the contributed data with reference datasets has been used in biodiversity citizen science projects to perform automated filtering of unusual observations [45]. However, rather than comparing the submitted data with the historical records, the ML algorithms could be used to perform real-time validation and confirmation of the newly contributed data. For example, species distribution models can be used to validate the spatial accuracy of biodiversity observations, or a CNN algorithm can be used to validate images labeled by the participants. • Classification of participant's level of expertise: The level of expertise and experience in contribution varies among participants in citizen science projects. For example, in biodiversity monitoring projects such as eBird [57] or iNaturalist, some participants contribute observations casually, while others are very involved and experienced and may even be considered as expert volunteers not only to contribute data but also to verify others' observations [58]. Thus, the contributors' previous records can be used in ML algorithms to classify the participants (e.g., by assigning them scores based on their level of expertise), and the newly contributed data can be validated based on the classification of the participants' levels of expertise. Figure 2 illustrates a taxonomy of possible combinations of ML and citizen science, which is classified according to the citizen science steps, including the three discussed categories of engagement, data collection, and data validation. Some of these ML integrations have already been applied in current citizen science projects, such as the automatic species identification or the classification of observers' levels of expertise in eBird, which will be explored in greater detail in the section on use cases. Nevertheless, there are some other potential impacts that, to the best of our knowledge, are not being applied in present projects, notably in terms of the role of ML in engaging participants through user profiles and recommendation systems. The following section presents and categorizes the use cases, taking into account the potential impacts of ML on citizen science stated in this section.

Use Cases
In this section, we present some of the use cases in which ML and citizen science are combined, with the goal of developing a typology of such projects based on the AI and ML applications outlined in Section 2 and the impacts of ML on citizen science tasks outlined in Section 3. We begin by categorizing the use cases based on the field of science and then present the most commonly used approaches in each category. The categorization of the use cases is shown in Table 1.
Environmental science: The most common approach in environmental studies is training ML algorithms using the images/videos labeled by citizen scientists to automate species identification and/or classification. Some of the common applied methods are as follows: • Camera trap projects: when it comes to the combination of ML and citizen science in biodiversity research, one of the most common approaches is the use of camera traps, where cameras are installed in nature to take photos of species, and the photos are then labeled by citizen scientists to feed and train ML algorithms [11,59]. Citizen scientists may, depending on the project, be involved in only one or all the activities of camera placement, submission of images, and labeling and classification of images/videos from camera traps [59]. MammalWeb [60], eMammal [61], and WildBook [62] are three examples of projects focused on camera traps data, and depending on the projects' goals, they invite volunteers to either collect or classify images ( Table 1). The use of contributed images to train CNN algorithms for automatic wildlife identification can result in the implementation of software packages such as the R package MLWIC (Machine Learning for Wildlife Image Classification) [63], which can be useful for environmental studies, particularly for ecologists. Another approach of integrating human and machine intelligence in camera trap projects is to invite volunteers to observe species images and confirm machine predicted labels in each image [11]. This approach helps to balance the time required for labeling images while maintaining high quality classification, and human intelligence is used for verification and identifying more challenging species that are difficult for machines to classify. • Species identification based on images and metadata: the majority of species identification projects use only images to train ML algorithms [64]. However, the identification of some species only with images and in the absence of other metadata is very complex both for humans and machines, and only human experts are able to distinguish among various images. Including metadata such as the spatial and temporal distribution or the ability of observers to identify species can increase ML predictive performance and provide more confidence in species identification. One example in this case is a study performed by Terry et al. [5] to identify ladybirds using both images and metadata such as location, date, and observer's expertise (Table 1). Another example is the eBird project [65], where a probabilistic model has been developed to classify observers as experts and novices, taking into account their experience in making contributions (Table 1). Another project, BeeWatch, invites citizen scientists to identify bumblebee species in images [66], and it employs natural language generation (NLG) to provide volunteers with real-time feedback (Table 1). Experiments conducted by the Bee-Watch researchers with project participants revealed that the automatically generated feedback improved the participants' learning and increased their engagement [66]. • Marine life identification: unlike other species, marine life identification by combining ML and citizen science has rarely been discussed [67]. In an article by Langenkämper et al. [67], the authors focused on combining ML and citizen science in annotation of marine life images. Citizen scientists are requested to annotate the images (digitize a bounding box around the species in the image); however, there is a possibility that volunteers may miss identifying the species (false negative), annotate a species which is not present in the image (false positive), or place the bounding box incorrectly. Despite all of the possible annotation errors, the authors conclude that merging citizen science with ML in marine life studies has considerable promise, providing that citizen scientists receive sufficient training prior to image annotation (Table 1). • Automatic wildlife counts from aerial images: estimating wildlife abundance is an important aspect of biodiversity conservation studies. One approach is to count the species in aerial images. However, if done entirely manually, this is an extremely time consuming and labor-intensive process. A study focused on the counts of wildebeests in aerial images [68] has illustrated promising results in obtaining accurate counts by combining citizen science and deep learning (Table 1). In this study, the counting is done by both citizen scientists and machines (a trained CNN algorithm), and while the results indicate that the machine performance is faster and more accurate than the human, the authors state that the citizen scientists' contributions are essential in providing training data to feed the algorithm.
Neuroscience: similar to environmental studies and species identification tasks, citizen scientists' input can be very valuable in amplifying the gold standard data generated by neuroscience experts. In [26], an approach is proposed to amplify expert-labeled MRI (magnetic resonance imaging) images using citizen science and deep learning. This approach involves three main steps. First, the experts label a collection of MRI images. Second, to amplify the labels, a web application called Braindr is implemented that presents a 2D brain slice to citizen scientists, and they are required to pass or fail the image taking into account its quality (check Figure 3). Finally, in the third step, a deep learning algorithm is used to verify the quality of the citizen science labels compared to the expert-labeled MRI images. Once the high-quality data are available, they are used to train a CNN algorithm to automate labeling the MRI images.
Astronomy: the involvement of the general public in online astronomy projects started in 2008 with the first release of the Galaxy Zoo project [69]. Traditionally, the classification of galaxy images in Galaxy Zoo was done by citizen scientists, but with advances in ML, the classification task was automated using amateurs and expert labels as input training data [70]. The Milky Way project is another well-known project in this field, with the goal of involving volunteers in identifying bubbles in images collected from space telescopes [71], and to automate the identification, the volunteers' labels were then used to train a random forest algorithm called Brut [72]. The authors mentioned that the combination of ML and citizen science in astrophysical image classification has opened a new path towards obtaining large scale classified datasets, which would have been more complex to achieve if each of these fields (citizen science and ML) were applied separately.   Table 1 illustrates that the majority of projects that combine citizen science and ML are in environmental science, which is also true for citizen science projects in general, where the number of biodiversity citizen science projects far outnumbers projects in other domains [75]. Furthermore, the table shows that, regardless of the area of science, the integration of citizen science and ML comprises primarily the use of labeled data from citizen scientists to feed ML algorithms. Typically, trained models are used to automate data collection (mostly labeling and object detection tasks in online citizen science projects) and data validation (automatic filtering and flagging the erroneous contributions). In contrast, the use of ML in citizen science to increase and sustain participation has received far less attention, with the BeeWatch project being the only one (among the studied use cases) that has directly evaluated the effects of automatic feedback on engagement.
Furthermore, while in most projects, once the model is trained, the identification/ labeling tasks can be completely automated, the majority of authors argue that the role of citizen scientists does not fade away and that human cognition can be used to perform more challenging tasks, such as verifying machine predictions or identifying rare species. Given these current projects and the prospect of further possible ML and citizen science integrations, the next section discusses the benefits and risks that may arise as a result of this combination.

Benefits and Risks
Although it is discussed that the combination of ML and citizen science offers more benefits than when they are implemented in isolation [9], there are several points that need to be considered prior to the integration of ML and citizen science. In this section, we discuss the benefits of combining citizen science and ML, as well as the potential risks that can arise if ML is not used cautiously in citizen science projects. The benefits and risks of ML and citizen science integration are discussed in the scope of engagement, data quality assurance, and ethics (check Figure 4). Data collection is not listed as a separate category in the section of benefits and risks since the impacts of ML on this step are integrated into the categories of engagement and data quality.

•
Benefits: As mentioned earlier, one of the benefits of AI for community building in citizen science projects is to encourage engagement by targeting the potential volunteers through social media. Another important factor in citizen science is the impact of the interaction with and feedback to the participants on the basis of their contributions [76,77]. Thus, the use of ML in citizen science in providing automated feedback to the participants might promote engagement through human-computer interaction and result in sustaining participation. Furthermore, the intelligently generated feedback can provide participants with useful knowledge about the research subject, allowing them to learn while contributing, which can be another factor in increasing participation (e.g., BeeWatch project). Another potential benefit of combining ML and citizen science is that it encourages interdisciplinary engagement among volunteers and researchers, which can lead to collaborations from several scientific fields [9]. Finally, automating certain simple tasks allows volunteers to concentrate on more complicated ones, such as identifying common species from camera trap images using CNN and leaving the identification of the unusual species to volunteers. However, there is another side to the task automation, which is discussed in the risk section. • Risks: The use of ML in citizen science could result in the automation of most tasks, which may demotivate participants because they are fully or partially being replaced by machines. As previously mentioned in the use cases, in most projects, citizen science data is used to train ML algorithms, and then the tasks can be performed entirely by machines, effectively replacing humans. While it has been mentioned that in the case of task automation, citizen scientists would then concentrate on more challenging tasks, some participants would like to contribute to citizen science projects to fill their spare time with activities that make them feel good, such as helping science or spending time in nature (see [36]), which are not inherently challenging. For example, in the sMapShot project [52] (a citizen science project for georeferencing historical images), there is strong competition among participants of higher age groups, and the incentive system plays an important role in motivating them; therefore, if the computer performs the task more efficiently, motivation is expected to drop, and thus participation will decline. One solution is that, considering all activity levels among participants, participants are allowed to contribute with their task of interest even if the task can be fully automated by machines, and thus the contributions can be helpful in retraining the algorithms to have a better performance. Another recommendation is to incorporate new forms of contributions to fill in the gap caused by automated tasks. Furthermore, another potential risk is the overestimation of AI power in citizen science projects, such as trusting model predictions over expert volunteers, which could result in disengaging the participants [8].

Data Quality
• Benefits: The use of ML in citizen science will speed up the process of big data validation, reducing the workload of manual data quality assurance for experts [46,47]. Prescreening and filtering data (for example, removing empty images or low-quality images in camera trap projects), flagging erroneous observations, and submitting only flagged observations for expert verification will save a lot of time and allow the experts to concentrate on the scientific aspects of the project rather than the manual filtering of all data. Furthermore, the generation of real-time informative and usercentered feedback for participants with information about their contributions will improve the participants' knowledge on the subject, their proficiency, and, as a result, the quality of data they contribute over time. Another finding from the BeeWatch project concerning the impact of feedback on volunteers was that NLG feedback resulted in increased learning, and the identification accuracy was higher for those who received informative feedback than for those who only received confirmation of correct identification [66]. • Risks: Although the benefits of automatic filtering and validation have been discussed, the efficiency and reliability of automated validation and feedback are highly dependent on the data used to train the ML algorithms. For example, if the training data are biased in some way, such as spatially or temporally, the automated data validation based on the trained model is also biased and could provide participants and experts with misleading information [9]. In addition to bias in the data, it is critical that the data used to train the model are of a gold standard and validated by experts, since the trained model will be used to verify new data, and if the input data are uncertain, the model will predict false detections [9], such as failing to identify a species, in the case of a false negative, or incorrectly detecting an abnormal shape in an MRI image, in the case of a false positive. It is important to keep in mind that machine intelligence should not be overestimated in comparison to human intelligence. In other words, when participants receive machine-generated feedback on their contribution, the decision to either modify or retain the contribution should be made by the participants, and human experts will make the final confirmation in such cases. It is also necessary to note that when a model is trained on data from a specific region, it cannot necessarily be applicable in other areas, and doing so can result in misevaluation and the generation of misleading information. Furthermore, training algorithms for small datasets (such as rare species, see [12]) or multitype datasets (such as a mix of images and metadata, see [5]) and learning how to tune the parameters of the algorithms to achieve the desired performance are hard challenges that must be considered prior to performing automated data validation in citizen science projects.

Ethics
• Benefits: The use of machine learning (ML) can be advantageous in filtering sensitive information from citizen science data, such as human faces or license plates in images. Furthermore, ML can be used to detect illegal actions, such as illegal animal trades, by sentiment analysis using information posted on social media platforms such as Twitter [78]. • Risks: One major concern of integrating ML in citizen science is the use of data collected from participants for other commercial reasons, which may go against the participants wishes and result in their disengagement from the project. Thus, it is critical to be transparent and communicate effectively with participants on how their inputs are being used in the algorithms, rather than simply creating a black box project in which the participants function is limited to producing data and feeding the algorithms [8][9][10]. As discussed in [8], technology giants like Google and Facebook offer target-oriented advertisement services by selling personal information, which can be a danger for the future of AI-based services used in citizen science projects, as it may lead to a lack of confidence on the part of participants to freely share their contributions and personal information. Another ethical issue that may emerge from ML-based citizen science projects is the sharing of sensitive data that may be deceptive or result in geoprivacy violations, such as predicting the position of endangered species or predicting participant activity based on the history of their contributions.

Future Challenges and Conclusions
Despite the existing projects and articles on the integration of ML and citizen science, this topic is still at its initial steps and requires further research discussing other benefits and risks and even proposing other use cases that are different from those that have been applied. In addition to this, there are potential challenges to and ideas about this subject that can be seen as future extensions of this integration, some of which can be performed in the near future of citizen science, and others requiring more time and investigation before being implemented in practice. The following are some potential challenges and future ideas: (1) One potential challenge is to explore the integration of ML in biodiversity citizen science projects for rare species identification, for instance, by using approaches such as few-shot learning [79]. In contrast to common ML algorithms, few-shot learning requires a very minimal amount of data to train the model, and it is primarily utilized in computer vision [80], of which a particular case is one-shot learning for face recognition [81]. (2) The focus of the use of ML in citizen science is currently more on automatic identification and less on user engagement; thus, exploring the use of ML in increasing engagement and sustaining participation remains an area for future investigation. For instance, one potential approach to be explored is the use of gamified AI in citizen science towards attracting more volunteers as well as sustaining participation [10,82]. (3) While the impact of machine-generated feedback on sustaining participation is discussed, one possible future challenge is to determine whether the generation of feedback that simulates more human responses, rather than repetitive generated feedback, can have an impact on increasing engagement. (4) Training participants has been shown in studies to improve data quality; however, providing training is not always simple and requires both human and financial resources. A possible suggestion will be to use AI to provide training prior to data collection; although this has been achieved in the case of feedback (for example, in the BeeWatch project [66]), AI can be used to provide training in a variety of ways, such as through interactive courses entirely managed by AI. (5) Participants are more motivated to contribute to a project if there have been prior contributions or if there are other participants for the sake of competition; however, large numbers of contributions will make participants feel less motivated and assume they have little to contribute to the project. One theory is that people in older age groups can become demotivated if there are too many contributions. One role of AI may be to consider user demographics and, as a result, balance how much data each user can visualize.
Furthermore, citizen science data are primarily based on the collection of images/videos or textual data, as seen in the use cases, but with emerging technology, the types of data collected can be extended. For example, some of the most recent smartphones support sensors that acquire LiDAR (light detection and ranging) data, and while this is currently a device-specific feature, given the rapid pace of technological development, we would expect it to be included in many future smartphones. Thus, LiDAR data can be a potential data type obtained in citizen science projects, and although some studies have been performed to identify objects from point clouds using deep learning [83,84], applying such techniques to LiDAR data collected by citizen scientists is a very interesting challenge towards the combination of ML and citizen science.
This review and other recent articles on the integration of AI and citizen science indicate that this combination demonstrates considerable potential for both fields. However, there are some consequences to this as well, as advancements in AI and the superior power of computers, in some cases better than humans, raise the possibility of completely replacing humans in citizen science projects. Nevertheless, there are certain tasks that cannot be performed without human input, such as activities that involve imagination, critical thinking, and communication skills. Furthermore, when combining ML and citizen science, it is critical that the primary goal of citizen science, engaging the general public in scientific projects and knowledge sharing with the public, does not fade away as a result of giving machines too much control. Furthermore, it is critical to apply transparency to the project and effectively communicate with volunteers about how the ML is being integrated and how the ML algorithms are using participants' input. Finally, prior to integrating ML in citizen science, the possible risks and benefits must be thoroughly investigated to determine which one has more weight, as well as to understand how to mitigate risks and maximize benefits from ML integration in the project at all levels, from user engagement to data quality assurance. Aside from the aforementioned concerns, a general aspect to consider is that, while incorporating AI into scientific research can be highly beneficial, it is essential to consider the context in which it is employed. For example, if AI is integrated into education, it is important to keep in mind that it does not prevent students from thinking by providing auto-responses to questions, such as the automatic identification of a vegetation type for an environmental student, which may result in preventing the student from learning the various landcover characteristics.
A potential extension of this review article will be to look for future AI-based citizen science projects and investigate their effect on each step of citizen science, as well as to elaborate on how the above listed challenges can be successfully implemented. Another potential extension would be to conduct analyses to quantify the risks and benefits discussed here. For example, one approach could be to evaluate the impact of real-time validation and feedback to participants by using indices to measure their engagement with the project, as well as by evaluating the quality of their contribution as a result of learning from the real-time feedback. We have developed a biodiversity citizen science project with the goal of collecting bird observations and using ML techniques to perform automatic data validation based on the location and time of observations. In this project, we provide real-time feedback to volunteers, for example, on bird species habitat characteristics [47]. As a follow-up to this review, we intend to analyze volunteers' behavior and explain the findings in the context of the risks and benefits addressed in this article.

Conflicts of Interest:
The authors declare no conflict of interest.