Buzzing through Data: Advancing Bee Species Identification with Machine Learning

: Given the vast diversity of bee species and the limited availability of taxonomy experts, bee species identification has become increasingly important, especially with the rise of apiculture practice. This review systematically explores the application of machine learning (ML) techniques in bee species determination, shedding light on the transformative potential of ML in entomology. Conducting a keyword-based search in the Scopus and Web of Science databases with manual screening resulted in 26 relevant publications. Focusing on shallow and deep learning studies, our analysis reveals a significant inclination towards deep learning, particularly post-2020, underscoring its ability to handle complex, high-dimensional data for accurate species identification. Most studies have utilized images of stationary bees for the determination task, despite the high computational demands from image processing, with fewer studies utilizing the sound and movement of the bees. This emerging field faces challenges in terms of dataset scarcity with limited geographical coverage. Additionally, research predominantly focuses on honeybees, with stingless bees receiving less attention, despite their economic potential. This review encapsulates the state of ML applications in bee species determination. It also emphasizes the growing research interest and technological advancements, aiming to inspire future explorations that bridge the gap between computational science and biodiversity conservation.


Introduction
Bee products, such as honey, propolis, bee pollen, royal jelly, beeswax, and bee venom, are valued for their extensive applications in the health, cosmetics, and food industries due to their rich contents of vitamins, minerals, antioxidants, and bioactive compounds.These products, which are essential in alternative medicine and profitable for beekeepers, contribute to the popularity of apiculture [1,2].Bees are crucial for ecosystem stability and significantly enhance the yield and quality of various crops through their unique pollination process.For instance, bee pollination has been shown to significantly improve both the yield and quality of tomatoes (Solanum lycopersicum L.) and blueberries [3,4].Moreover, farmers rear bees to provide pollination services to plants grown in controlled environments, such as greenhouses [5].The estimated economic value of pollination services was USD 577 billion in 2009, highlighting their importance in agriculture [6,7].
However, the global bee population is under threat from multiple fronts, including from the use of agrochemical pesticides, diseases, destruction of habitats, changes in weather, genetically altered crops, and competition for limited resources [8].The decline in bee populations necessitates vigilant monitoring and management to ensure their survival and continued contribution to ecosystems and agriculture [9,10].This process involves not just understanding the spatio-temporal patterns, biodiversity, and habitats of bee species but also their health and behavior in relation to environmental factors.For instance, beekeepers commonly monitor the foraging behavior of bees to assess the availability of food, colony age, and pesticide impact [11].On the other hand, in greenhouse environments where bees are specifically bred for pollination, farmers closely monitor bee activity to determine the optimal timing for replacing beehives.This practice ensures that pollination occurs efficiently, contributing to the health and productivity of the crops being cultivated [12].
Identifying bee species and subspecies is an essential activity that is intimately linked to monitoring efforts.This identification is pivotal for effective breeding, comprehensive conservation strategies, and optimized agricultural practices, which collectively influence the sustainability of bee populations and the overall efficiency of pollination services.Accurately determining the most suitable and economically viable species for various environmental conditions and specific purposes, such as greenhouse cultivation, is critical.In the realm of apiculture, the significant variation in honey production across different bee species and subspecies further highlights the necessity of precise bee identification.This precision is vital not only for supporting effective breeding programs and ensuring product certification but also for enhancing conservation efforts.Ultimately, the precise identification of bee species underpins efforts to optimize agricultural and apicultural productivity, underscoring its importance in maintaining ecological balance and promoting sustainable practices [13].
The conventional techniques of species identification rely on the morphological characteristics of tiny body parts of bees, including the size of various body parts, wing venation, pilosity, and pigmentation.This process necessitates specialized tools and a skilled individual to accurately measure various morphological characteristics.Crucially, keen observation is required to discern subtle features such as pigmentation, facial details, genitalia, and pilosity [14,15].Fine scaling is normally required to determine species with high precision due to the high similarities among species.Given the subtle differences among species, many cannot be accurately identified in natural settings or from photographs without a direct, sometimes destructive, examination.This often requires the species to be collected, cleaned, pinned, and even dissected to allow a detailed inspection of their morphological features under a high-powered microscope.However, this process is time-consuming and increasingly viewed with disfavor, especially for sensitive species [16,17].
Another proven powerful approach is the use of a molecular tool kit for bee identification, which incorporates various markers of the mitochondrial (e.g., tRNAleu-cox 2 intergenic region) and nuclear DNA (e.g., microsatellites and single nucleotide polymorphisms [SNPs]).Despite their proven accuracy, these techniques are rarely used by those who could benefit most-beekeepers and breeders, particularly those working with diverse species or engaged in conservation efforts.Precise species identification is essential for effective management and conservation, helping these professionals make well-informed decisions regarding species selection and management practices.The main obstacles to widespread adoption are the high costs, the necessity for expert knowledge, and the need for specialized equipment and reagents [18].These financial and technical challenges severely restrict the routine use of molecular identification in beekeeping, underscoring a significant gap between the availability of advanced scientific tools and their practical applicability in the field.
The complexity of bee species identification is exacerbated by the sheer diversity of species, estimated at around 20,000 globally [19].This diversity presents a significant challenge for the limited number of experts in the field, with only about 50 taxonomists worldwide skilled in accurately identifying these species.This scarcity of experts under-scores the need for more accessible and efficient identification methods to support the conservation and study of bee populations [20].
Advancements in computer-based technologies and artificial intelligence (AI) are stepping up to meet this need [21].These technologies offer the promise of high-accuracy species identification with reduced human intervention, significantly enhancing the efficiency of bee monitoring efforts.The evolution of AI, in particular, has proven advantageous in fields where expert knowledge is scarce, providing valuable decision-making support.Consequently, the burgeoning field of machine learning (ML) applications in bee species identification has attracted considerable research interest.This shift towards leveraging ML for bee identification not only reflects the adaptability and potential of technology in biodiversity conservation but also underscores the importance of reviewing and understanding the diverse methodologies and datasets employed across the spectrum of existing studies.
This paper presents a comprehensive literature review focused on the use of machine learning (ML) techniques for identifying bee species, with a particular emphasis on honeybees and stingless bees.It begins with a bibliometric analysis to explore authorship and citation patterns, which helps in understanding the research dynamics within this scientific area [22].The review methodically examines relevant publications to provide an in-depth analysis of the data, methodologies, technologies, and performance metrics employed.By highlighting the current state of the field, identifying gaps, and pointing out emerging trends, this literature review aims to enrich our understanding and facilitate further advancements in the domain of bee species identification.The ultimate goal is to equip researchers with the insights needed to navigate and contribute to this evolving field, thereby enhancing the application of ML in entomology.

Bee Species Determination Using Machine Learning
Bees, crucial pollinators within the Apoidea superfamily, are distinguished by their vast diversity, with over 20,000 identified species.The taxonomic hierarchy within the Apidae family organizes its vast diversity across multiple levels, encompassing sub-family, tribe, genus, species, and subspecies, providing a structured framework for understanding the intricate relationships and characteristics of each member within this family (Figure S1).They are characterized by their members' roles in pollination and the various social structures observed in the different species.Specifically, the study primarily concentrates on the Apidae family, which encompasses a diverse range of honeybees and stingless bees, although it is not exclusively restricted to these species.
The research considers the determination of any one species within the Apidae family by leveraging ML to mimic human learning through data and algorithms.Although the term "species classification" frequently appears in the machine learning literature, "species determination" is used throughout this manuscript.This terminology was chosen to clearly distinguish this work from taxonomic classification, which organizes species into hierarchical groups based on their evolutionary relationships and morphological characteristics.Using "species determination", the manuscript clarifies that the objective is to identify distinct bee species using machine learning models, avoiding any potential confusion for readers with backgrounds in biological sciences.The ML field encompasses both shallow learning (SL) and deep learning (DL) approaches.Shallow learning, characterized by simpler models with fewer processing layers, includes techniques such as Linear Regression (LR), Decision Trees (DT), and Support Vector Machines (SVM), which whilst less structurally complex, are only capable of handling simpler data patterns [23].Deep learning, in contrast, involves sophisticated neural networks with multiple layers, enabling the handling of more complex, high-dimensional data.Prominent examples of deep learning include convolutional neural networks (CNNs) and recurrent neural networks (RNNs), which are widely used in advanced applications such as image and speech recognition [24][25][26].A consideration of both SL and DL allows for an encompassing search application on the topic, catering to a broad spectrum of complexities found in real-world scenarios.

Search Strategy
A systematic search was conducted in the Web of Science (https://mjl.clarivate.com/(accessed on 13 November 2023)) and Elsevier Scopus (https://www.scopus.com/(accessed on 13 November 2023)) databases, in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) Statement [27].These databases are recognized as two of the most comprehensive academic databases, offering extensive publication search capabilities and providing utility in the ranking of journals based on productivity and citation impact; thus, they are capable of providing a robust foundation for a bibliometric analysis [28].
The search protocol targeted original research articles published in the English language that employed ML techniques for the detection or classification of honeybee and stingless bee species.Due to their significant ecological, economic, and scientific importance, these species were the specific focus.To maintain a clear focus, the review excludes non-research materials such as book chapters, conference abstracts, white papers, case reports, editorials, review papers, and technical reports.
Keyword searches were conducted using the advanced search functionalities of two major databases, focusing specifically on the title, abstract, and keywords of publications.The keywords, detailed in Table 1, were strategically selected to balance precision and breadth, ensuring relevance without excluding pertinent studies.This targeted approach was critical to avoid the pitfalls of both overly narrow and overly broad keywords, which could lead to missing relevant papers or retrieving a large volume of irrelevant material, respectively.The selected keywords emphasize studies on the identification and classification of honeybees and stingless bees using machine learning techniques, reflecting the crucial roles these species play in agriculture and biodiversity, and their cultivation for valuable bee products.The search was particularly aimed at capturing publications that address the challenges of species determination and the application of machine learning methodologies in this context.Extracted data included the publication titles, abstracts, keywords, authors and their affiliations, publication years, and citation counts up to the date of the search.The searches on the two databases were conducted on 13 November 2023, and hence, publications indexed in the databases up to the date of the search were included.In total, 117 and 164 publications were captured from the Web of Science and Scopus databases, respectively.Fifty-five publications were identified as duplicates, appearing in search results from both the WoS and Scopus databases, and were subsequently removed.Another publication, which was not available on the internet, was also removed.A total of 225 publications were considered for the screening stage.A flow chart of the PRISMA declaration figure showing the flow of the search is presented in Figure 1.
The publications, particularly, their titles and abstracts, were perused and thoroughly analyzed to ensure relevance to the topic, with 193 publications deemed irrelevant and another three publications excluded due to accessibility issues with the full texts.After screening the full texts, an additional three articles were identified as unrelated because they either did not utilize ML or did not specifically target bee species determination.Finally, only 26 publications were retained for further analysis.The publications, particularly, their titles and abstracts, were perused and thoroughly analyzed to ensure relevance to the topic, with 193 publications deemed irrelevant and another three publications excluded due to accessibility issues with the full texts.After screening the full texts, an additional three articles were identified as unrelated because they either did not utilize ML or did not specifically target bee species determination.Finally, only 26 publications were retained for further analysis.
It is noted that the review is limited to publications indexed in either the Web of Science or Scopus databases up to the search date.Additionally, publications that may be relevant to the search topic of bee species determination using machine learning methods but have not been captured by the keyword search strategy were not included in the analysis.

Bibliographic Analysis
The bibliographic analysis was conducted using a quantitative analysis approach and knowledge mapping techniques on the 26 publications retained after the screening process, using data obtained from Scopus and WoS.The publications were categorized into three categories, depending on the method utilized for the bee identification or species determination task to facilitate the analysis.The 3 categories are shallow learning (SL), deep learning (DL), and combinations of both SL and DL.It is noted that the review is limited to publications indexed in either the Web of Science or Scopus databases up to the search date.Additionally, publications that may be relevant to the search topic of bee species determination using machine learning methods but have not been captured by the keyword search strategy were not included in the analysis.

Bibliographic Analysis
The bibliographic analysis was conducted using a quantitative analysis approach and knowledge mapping techniques on the 26 publications retained after the screening process, using data obtained from Scopus and WoS.The publications were categorized into three categories, depending on the method utilized for the bee identification or species determination task to facilitate the analysis.The 3 categories are shallow learning (SL), deep learning (DL), and combinations of both SL and DL.
Knowledge mapping, including an analysis of the co-occurrence of author specified keywords and collaborations, utilized VOS viewer software (version 1.6.20).Results from the Scopus database were utilized for this analysis since it contains the majority of the articles (88.46%).Perianes-Rodriguez et al. [29] advocates for fractional counting over full counting for its ability to provide proper field-normalized results, allocating co-authorship contributions proportionately among the authors [30].This analytical approach was applied, generating insights into co-occurrence network of the most frequently used keywords and mapping co-authorship patterns across countries and among various organizations, thereby enhancing the understanding of collaborative trends and thematic focuses within the research community.

Detailed Review of Bee Identification/Species Determination Techniques
After the bibliographic analysis, the systematic review process proceeded with a detailed analysis of selected publications.Datasets that have been utilized in the different publications were studied, before the selected publications were categorized by ML approaches, shallow learning, deep learning and combinations of both shallow and deep learning approaches, for a deeper analysis.Studies within each category focusing on different bee species determination techniques were examined to encapsulate the innovative concepts and methodologies documented in the publications.This process also aimed to enhance the understanding of advancements in machine learning applications for bee species determination.Finally, the different performance measures that have been utilized were discussed.

Bibliographic Analysis
A total of 281 publications were initially extracted from both databases: 117 and 164 publications from the Web of Science and Scopus databases, respectively.This was reduced to 26 relevant publications after accounting for duplications, irrelevant publications, and other exclusions.Only publications indexed in the databases on 13 November 2023 were included in this study.
The categorization of methodologies across the 26 studies reveals a prevailing preference for deep learning techniques.Deep learning approaches, including CNNs and RNNs, are utilized in 12 publications, constituting 46.15% of the total publications.Five publications or 19.2% of the total utilized both deep and shallow learning for their classification.On the other hand, only nine studies rely solely on shallow learning, with the earliest shallow learning application tracing back to 2001.This trend underscores a significant shift towards the adoption of complex ML techniques in the research field.A detailed summary of each publication, categorized by the methods employed, is listed in the Supplementary Materials, specifically Tables S1-S3.
Figure 2 illustrates a temporal analysis of the 26 publications on machine learning for bee species determination, showing a significant trend in research focus and methodologies from 2014.An increasing number of annual publications is observed, particularly from 2020.This pattern suggests growing interest within the research community.It is noted that since the search was performed at the end of the year 2023, the publication data for 2023 may be incomplete as it normally takes a few months for publications to be indexed in the databases.This rising trend is further reinforced by the annual citation data shown in Figure 3, indicating a substantial growth in interest post-2017 and emphasizing the field's dynamic and emerging nature.Additionally, the trend towards deep learning (DL) has become particularly pronounced post-2020, as evidenced by the year-on-year increase in publications utilizing DL.
An analysis of the most frequently used keywords by authors offers valuable insight into the prevailing research interests and priorities within the scientific community [31].Consequently, scrutinizing the author-defined and index keywords in the selected publications is crucial for understanding the current trends and focal points in the field.A total of 272 keywords were identified and setting a minimum co-occurrence of three for these keywords to be included in the analysis resulted in 21 most used keywords, as shown in Table 2.The accompanying keyword co-occurrence network in Figure 4 illustrates each keyword as a node, with the node size scaled to reflect the frequency of keyword occurrences.Links between nodes represent the co-presence of the keywords in publications.Moreover, the color coding in Table 2 and Figure 4 distinguishes different clusters, with each color representing a group of keywords that frequently appear together.This method highlights thematic connections between keywords, facilitating an intuitive understanding of the main research areas and their interrelationships within the field.An analysis of the most frequently used keywords by authors offers valuable insight into the prevailing research interests and priorities within the scientific community [31].Consequently, scrutinizing the author-defined and index keywords in the selected publications is crucial for understanding the current trends and focal points in the field.A total of 272 keywords were identified and setting a minimum co-occurrence of three for these keywords to be included in the analysis resulted in 21 most used keywords, as shown in Table 2.The accompanying keyword co-occurrence network in Figure 4 illustrates each keyword as a node, with the node size scaled to reflect the frequency of keyword occurrences.Links between nodes represent the co-presence of the keywords in publications.Moreover, the color coding in Table 2 and Figure 4 distinguishes different clusters, with each color representing a group of keywords that frequently appear together.This method highlights thematic connections between keywords, facilitating an intuitive understanding of the main research areas and their interrelationships within the field.An analysis of the most frequently used keywords by authors offers valuable insight into the prevailing research interests and priorities within the scientific community [31].Consequently, scrutinizing the author-defined and index keywords in the selected publications is crucial for understanding the current trends and focal points in the field.A total of 272 keywords were identified and setting a minimum co-occurrence of three for these keywords to be included in the analysis resulted in 21 most used keywords, as shown in Table 2.The accompanying keyword co-occurrence network in Figure 4 illustrates each keyword as a node, with the node size scaled to reflect the frequency of keyword occurrences.Links between nodes represent the co-presence of the keywords in publications.Moreover, the color coding in Table 2 and Figure 4 distinguishes different clusters, with each color representing a group of keywords that frequently appear together.This method highlights thematic connections between keywords, facilitating an intuitive understanding of the main research areas and their interrelationships within the field.Five clusters are formed by the keywords, with significant correlations observed between keywords belonging to the same cluster.Signifying the intersection of biological and computational sciences, the red cluster highlights the fundamental aspects of bee study (taxonomy and pollination) alongside advanced analysis techniques (machine learning and computer vision).Prominence is given to "bee" and "machine learning", with occurrences of nine and eight, respectively, within this cluster, underlining their critical roles in the review.This emphasis reflects a focused exploration of machine learning applications in the nuanced task of bee species determination and demonstrates a cohesive blend of entomological research with computational advancements.
The green cluster is the second largest cluster and directly relates to the exploration of bee species determination using machine learning.It emphasizes the integration of artificial intelligence with techniques such as support vector machines to classify bee species, particularly Apis mellifera.The prevalent focus on A. mellifera is due to the widespread availability of specimens, which facilitates extensive research.This abundance enables the application of sophisticated learning systems that enhance species differentiation, advancing the precision of entomological research through computational methods.
The yellow cluster includes keywords such as "deep learning", "convolutional neural network", "ecology", and "transfer learning".These reflect the use of advanced computational strategies in bee species determination.The cluster highlights how deep learning technologies, which are essential for interpreting complex ecological data, play a pivotal role in accurately identifying bee species.Meanwhile, the blue cluster features keywords such as "Apoidea", "biodiversity", "Hymenoptera", and "pollinator", underlining the ecological and biological context of bee species determination.It emphasizes the importance of understanding bee diversity and the role of pollinators within ecosystems, bridging the gap between computational methods and ecological insights in the study of bees.
The smallest cluster, purple, comprises only the keywords "image classification" and "object detection".These terms underscore the technological methodologies that enable the precise identification of bee species through visual data.Image classification and object detection are pivotal in analyzing bee images, enabling species differentiation by recognizing patterns and features within visual datasets.This cluster accentuates the significance of advanced image processing techniques in enhancing the accuracy and efficiency of machine learning models for bee species determination tasks.
The keywords "classification", "bee", "machine learning", and "deep learning", are the four highest-occurring keywords, with eleven, nine, eight, and eight occurrences, respectively.Central to bee species determination using machine learning, these terms underscore the synergy between biological importance and technological progress.They highlight the use of machine learning's cutting-edge capabilities, particularly deep learning, for precise bee species identification, illustrating the blend of biological research with computational advances.
Figure 5 presents the global collaboration network in bee species determination research, focusing exclusively on the largest connected network for visual clarity.This network prominently features the intensive collaborative links among European countries such as Portugal, France, Ireland, and Belgium, with Portugal serving as a central hub of activity.Similarly, Brazil plays a significant role, demonstrating substantial links with Chile, Finland, and Australia.Additional collaborative efforts involving Germany and Austria, as well as the United States and Canada, while significant, are not visually represented due to their smaller network size but are acknowledged in the text to ensure comprehensive coverage of global efforts.Figure 6 further details the scope of these collaborations by quantifying the number of publications and the strength of the collaborative links, adopting a fractional counting method for calculation.Brazil is highlighted as leading in these metrics, followed by Australia and Chile.European countries, despite fewer publications, display robust collaborative ties.Bangladesh is noted as the sole significant Asian contributor, with the US and Canada also showing moderate collaboration strength based on fewer publications.Our analysis has identified 77 organizations from various countries that contributed to the selected publications, with notable inter-institutional collaborations.The largest network of collaborators, consisting of 14 organizations, was considered and is presented in Figure 7.The Research Centre In Digitalization And Intelligent Robotics (CEDRI) in Portugal emerges as a leader in collaborative research, producing two significant publications on bee species determination through partnerships both within Portugal and internationally.Its collaborators include organizations from France, Brazil, Poland, the United Kingdom, Russia, Switzerland, Belgium, and Ireland.Our analysis has identified 77 organizations from various countries that contributed to the selected publications, with notable inter-institutional collaborations.The largest network of collaborators, consisting of 14 organizations, was considered and is presented in

Dataset Characteristics
The reviewed publications consider various types of data for bee species discrimination that serve as input to the models, with the type of data influencing the model and its effectiveness for the bee species determination tasks.Generally, three data types have been employed: (1) images of the bees, either as static images or a time-series of images, i.e., videos, of the full body of the bees or specific parts of them; (2) acoustic features of

Dataset Characteristics
The reviewed publications consider various types of data for bee species discrimination that serve as input to the models, with the type of data influencing the model and its effectiveness for the bee species determination tasks.Generally, three data types have been employed: (1) images of the bees, either as static images or a time-series of images, i.e., videos, of the full body of the bees or specific parts of them; (2) acoustic features of the bees, including buzzing and flying sounds; and (3) movement of the bees, such as wing beat movement or traffic patterns of the bees.Figure 8 depicts a doughnut pie chart, with the inner layer representing the methods utilized and the outer layer representing the proportion of the different data types used.

Dataset Characteristics
The reviewed publications consider various types of data for bee species discrimination that serve as input to the models, with the type of data influencing the model and its effectiveness for the bee species determination tasks.Generally, three data types have been employed: (1) images of the bees, either as static images or a time-series of images, i.e., videos, of the full body of the bees or specific parts of them; (2) acoustic features of the bees, including buzzing and flying sounds; and (3) movement of the bees, such as wing beat movement or traffic patterns of the bees.Figure 8 depicts a doughnut pie chart, with the inner layer representing the methods utilized and the outer layer representing the proportion of the different data types used.Images serve as the primary data type for bee species determination in 20 publications, or 76.9% of the overall publications utilizing them, including one publication using both image and movement; either utilizing full images of the bees (9 publications or 34.6%) [16,[32][33][34][35][36][37][38][39] or focusing solely on wing imagery only (10 publications or 38.46%) [13,20,[40][41][42][43][44][45][46][47].While publication [36] was unique in capturing video footage of bee species for their study, they opted for an unconventional approach by focusing the classification analysis on a single frame rather than exploiting temporal variations across multiple frames for enhanced accuracy.The utilization of wing images over other body parts is attributed to their unique venation patterns distinct to each species, their consistency compared to other bodily features, and the ease of imaging and analysis they offer.Images serve as the primary data type for bee species determination in 20 publications, or 76.9% of the overall publications utilizing them, including one publication using both image and movement; either utilizing full images of the bees (9 publications or 34.6%) [16,[32][33][34][35][36][37][38][39] or focusing solely on wing imagery only (10 publications or 38.46%) [13,20,[40][41][42][43][44][45][46][47].While publication [36] was unique in capturing video footage of bee species for their study, they opted for an unconventional approach by focusing the classification analysis on a single frame rather than exploiting temporal variations across multiple frames for enhanced accuracy.The utilization of wing images over other body parts is attributed to their unique venation patterns distinct to each species, their consistency compared to other bodily features, and the ease of imaging and analysis they offer.Collectively, these make the wing image an excellent and reliable morphological feature for species differentiation and suitable for automated analysis.Only one study relies on images of bees whilst in-flight for species determination [35], with the others rely on images when bees are relatively stationary.This highlights the challenges of capturing images of airborne bees due to their rapid movement, small size, and presence of potential occlusions.
Only six publications, or 23% of the overall publications, utilized acoustic sound and movement of the bees for the species determination tasks.Acoustic sounds consider buzzing and flying sounds produced by bees.Though these sounds are interrelated, they exhibit unique features.Buzzing is primarily produced as a result of wing vibrations and may vary in frequency.It serves as a means of communication and temperature regulation within the hive.Flying sounds, on the other hand, are generated specifically during flight and are characterized by changes in pitch and intensity based on the bees' flight speed and wing size.Analyzing these auditory signals offers valuable insights into species differentiation.
Some studies have also focused on the analysis of wing beats and bee traffic; categorized broadly as movement of the bees.Wing beats, the rapid flapping of bees' wings, have distinct patterns that can vary between species, and these variations can be crucial for the identification and differentiation between bee species.Bee traffic refers to the frequency and patterns of bees moving in and out of beehives, and may similarly provide insights for bee species determination purposes.Both wing beats and bee traffic patterns are valuable data sources, offering unique perspectives in the study of bee species, their behaviors, and their ecological roles.
Nearly all studies employing deep learning for bee species determination predominantly use image data, leveraging its capacity to handle complex, high-dimensional visual information, a strength of deep learning algorithms like CNNs.Conversely, shallow learning methods, utilized in only half of their respective studies for image-based classification, also tap into movement and acoustic data, reflecting their adaptability to simpler, less complex data types.The preference for deep learning in image-based species determination is due to its exceptional capability to detect and learn complex patterns and features from visual data, a critical factor for achieving precise classification.On the other hand, species determination tasks based on bee movement exclusively utilized shallow learning algorithms.This singular reliance on shallow learning for processing movement data underscores its efficiency and adaptability in interpreting bee motion behaviors.
The volume and diversity of datasets play crucial roles in ML.Large datasets improve the robustness of ML models, allowing them to learn and generalize better, whilst diversity in data, especially in class representation, is vital for ensuring that the model can accurately identify and classify a wide range of bee species.Ideally, the dataset should also be wellbalanced, with each class equally represented.This prevents biases during training of the model and consequently enhances the overall performances of the species determination task.Figure 9 illustrates the number of species considered in each publication, with size of the dataset represented as dimensions of the circle.Different data types are visually distinguished by color coding as follows: purple represents acoustic data, blue and brown denote image and movement data, respectively, and green indicates image and movement data combined.The size of each circle is proportional to the dataset size, with larger circles corresponding to larger datasets.The image data type constituted the most extensive and diverse collection of datasets.Buschbacher et al. [40] focused on classifying the highest number of species, with 127 species of bees represented using wing images, while Kawakita and Ichikawa [12] relied on 89,000 full images of bees, representing the most comprehensive and diverse datasets, respectively.These are indeed expected, as it is much easier to obtain good dataset of images as compared to other data types, especially with the fast progression of mobile technology.A study conducted by Parmezan et al. [48] utilized more than 56,000 samples of the movement data type to classify a total of seven species.It is interesting to note that the diversity of bees considered does not necessarily correlate with the volume of dataset, with some researchers relying on smaller dataset to classify larger number of species, and vice versa.
Figure 10 shows the distributions of publications focusing on the different species of honeybees, stingless bees and others, including bumblebees, hornets, and other insects.Seventeen publications focused on honeybees, either to identify different species and subspecies of honeybees, or in conjunction with other insects.On the other hand, only five publications considered a dataset of different species of stingless bees.This is despite the diversity of stingless bee species in the world amounting to more than 600 species and their economic potential in producing high-valued bee products.
The geographical origins of the dataset hold significance, with some bee species endemic to specific regions only.Figure 11 illustrates the origin of the dataset considered by only 13 of the studies; 13 studies did not disclose the geographical source of their dataset.Furthermore, some studies employed datasets originating from multiple geographical sources.Four studies considered datasets originating from Brazil, with one study in conjunction with species from Germany, the United States and China.Datasets originating from the United States, China, and Spain were featured in two publications each, while datasets from twenty-one other countries were each considered in only one publication.The geographical distribution of datasets illustrates that data collection is not diverse enough, with some regions unrepresented, despite their well-known activity in apiculture.This omission underscores the importance of regional diversity in datasets to accurately reflect the global distribution of bee species and ensure the robustness of species determination models.
does not necessarily correlate with the volume of dataset, with some researchers relying on smaller dataset to classify larger number of species, and vice versa.Figure 10 shows the distributions of publications focusing on the different species of honeybees, stingless bees and others, including bumblebees, hornets, and other insects.Seventeen publications focused on honeybees, either to identify different species and subspecies of honeybees, or in conjunction with other insects.On the other hand, only five publications considered a dataset of different species of stingless bees.This is despite the diversity of stingless bee species in the world amounting to more than 600 species and their economic potential in producing high-valued bee products.The geographical origins of the dataset hold significance, with some bee species endemic to specific regions only.Figure 11 illustrates the origin of the dataset considered by only 13 of the studies; 13 studies did not disclose the geographical source of their dataset.Furthermore, some studies employed datasets originating from multiple geographical sources.Four studies considered datasets originating from Brazil, with one study in conjunction with species from Germany, the United States and China.Datasets originating from the United States, China, and Spain were featured in two publications each, while on smaller dataset to classify larger number of species, and vice versa.Figure 10 shows the distributions of publications focusing on the different species of honeybees, stingless bees and others, including bumblebees, hornets, and other insects.Seventeen publications focused on honeybees, either to identify different species and subspecies of honeybees, or in conjunction with other insects.On the other hand, only five publications considered a dataset of different species of stingless bees.This is despite the diversity of stingless bee species in the world amounting to more than 600 species and their economic potential in producing high-valued bee products.The geographical origins of the dataset hold significance, with some bee species endemic to specific regions only.Figure 11 illustrates the origin of the dataset considered by only 13 of the studies; 13 studies did not disclose the geographical source of their dataset.Furthermore, some studies employed datasets originating from multiple geographical sources.Four studies considered datasets originating from Brazil, with one study in conjunction with species from Germany, the United States and China.Datasets originating from the United States, China, and Spain were featured in two publications each, while datasets from twenty-one other countries were each considered in only one publication.
The geographical distribution of datasets illustrates that data collection is not diverse enough, with some regions unrepresented, despite their well-known activity in apiculture.This omission underscores the importance of regional diversity in datasets to accurately reflect the global distribution of bee species and ensure the robustness of species determination models.Fifteen publications reported undertaking original data collection, a notable figure highlighting the significant efforts to amass new data in the field.This substantial number of data collection initiatives includes two studies that augmented their data with external sources.A novel method of data collection, crowdsourcing of data, was reported in one Fifteen publications reported undertaking original data collection, a notable figure highlighting the significant efforts to amass new data in the field.This substantial number of data collection initiatives includes two studies that augmented their data with external sources.A novel method of data collection, crowdsourcing of data, was reported in one publication [32], showcasing an innovative strategy for data acquisition in a field that has limited amounts and diversity of datasets.Additionally, several studies leveraged datasets previously published, as well as data from public databases, including iNaturalist, BugGuide, BeeSpotter, the CREA Research Centre for Agriculture and Environment (CREA-AA), Kaggle, MS COCO, and the Morphometric Data Bank in Oberursel, Germany, reflecting a reliance on existing resources.Together, these points underscore the significant challenge of dataset scarcity, particularly in certain geographical regions, and the diverse methods researchers are employing to address this issue.
Against the backdrop of 16 publications that embarked on original data collection, including one innovative crowdsourcing approach, a diverse range of methodologies from direct photography to acoustic recordings was utilized.This multifaceted approach not only highlights the creative strategies adopted to navigate dataset limitations but also emphasizes the commitment to expanding research into underrepresented geographical areas, significantly enriching the field of bee species determination with varied data collection methods.
Various methods were employed for wing image collection.Bee wings were dissected [46], while others were merely immobilized using an icebox for a less invasive approach [20].To enhance the detail of these images, researchers used optical magnification tools, including stereomicroscopes [45], microscopes [20], and magnifying glasses [46], although some images exhibited salt and paper noise [46], making the task more challenging.These instruments are essential in research for enhancing the visibility of small details and are particularly useful in the study of bee morphology, where a precise examination of features like wing venation is crucial for species determination.Some researchers have also taken direct camera shots without magnification of the wings, capable of achieving an impressive accuracy of 96% [44].For full-body bee images, researchers employed various innovative techniques, including the use of trap cameras positioned at hive entrances [33,42] and sticky yellow plates to capture bees [39].Additionally, infrared (IR) imaging was utilized [35], taking advantage of its low light impact, which is crucial for capturing images without altering the natural behavior or appearance of bees, some of which may vary in color with age.These methods enable detailed observations with minimal disturbance, facilitating the collection of valuable data for classification and study.
Acoustic data, including buzzing and flight sounds, were recorded using both generalpurpose microphones [12] and specialized recorders such as the SongMeter SM2 [49].One study employed wildlife acoustic recorders designed for capturing natural sounds [49], though no significant difference in performance was noted across these methods.Additionally, wing beat sounds were detected using optical sensors in two studies [48,50], emphasizing the need to monitor environmental factors such as temperature and humidity, which influence wing beat frequency.This approach underlines the importance of considering various environmental conditions in accurately capturing bee movements and sounds.

Methods for Bee Species Determination
This section delves into the methodologies for bee species determination via machine learning.It is divided into three sub-sections focused on studies using shallow learning, deep learning, and a combination of both.This approach aims to connect the characteristics of the datasets with the specific machine learning methods applied, providing a comprehensive perspective on the current methodologies in the field and their impacts on advancing bee species determination research.

Publications Utilizing Shallow Learning (SL) Only for Species Determination
Of the nine publications utilizing shallow learning methods, four publications utilized images, specifically, wing images, with two and three publications utilizing acoustic and movement data types for the species determination task.
The review of four publications employing wing images for bee species determination showcases a variety of approaches, focusing on different features such as basal cells, landmarks, and vein skeletons, alongside various machine learning algorithms, including SVM, Multilayer Perceptron (MLP), and K-Nearest Neighbor (KNN).Early work in 2001 proposed a novel approach to bee species determination by focusing on the stable features of basal cells in wing images, detected through lines and intersections [20].Utilizing Linear Discriminant Analysis (LDA), the method extracted key features to apply a deformable template specific to bee families, aiding in the identification of other wing cells.This process culminated in the use of SVM or Kernel Discriminant Analysis (KDA) to accurately recognize species based on these extracted features, demonstrating a sophisticated blend of image analysis and machine learning techniques for species determination.
While previous publications focused on basal cell features in bee wings, reference [46] expanded the approach by utilizing both landmark-based features and associated images.These landmarks, located on wing veins, were automatically detected using TpsDig software(version 2.16).Figure 12 illustrates a sample of a wing of a honeybee showing the cells, veins, and landmarks.An Orthogonal Procrustes Analysis enabled the extraction of coordinates invariant to translation, scale, and rotation.The images were segmented into 256 quadrants, with the mean and standard deviation of each quadrant's blue channel serving as features.Among several shallow learning algorithms tested, the MLP showed superior performance.This study underscores the enhanced accuracy achieved by integrating image features with landmark information, outperforming methods relying solely on landmark features.Silva et al. [47] adopted a landmark-based approach for bee species determination, examining seven combinations of feature selectors and classifiers.By analyzing 19 landmark positions and generating 84 additional landmark-based features, including centroid size, weight matrix, and relative warp scores, the study employed various feature selection methods-information gain, chi-square, correlation, and Fisher's separation criterion-to refine the dataset.The research assessed classification efficacy using a variety of classifiers: LDA, Naïve Bayes (NB), Logistic, KNN, C4.5, MLP, and SVM.It has been demonstrated that the NB classifier with correlation feature selector yielded the best performance, underscoring the crucial role of feature selection in improving performance.Despite its advances, the study did not surpass the performance previously reported in reference [46], highlighting the significant impact of integrating diverse wing feature data for more precise bee species determination.
A fully automatic approach for bee species determination through wing image analysis was proposed [44], achieving an accuracy of up to 96% for species identification.It enhanced the classification accuracy by integrating color features with patterns derived from the wing's vein skeleton, employing the KNN algorithm for classification.The method significantly minimized manual intervention utilizing automated image processing techniques for feature extraction, showcasing the potential of combining different data types for robust classification.
Two publications utilized SL techniques for the determination of bee species by basing their classification on the acoustic sounds produced by the bees.Flight sounds of bees and hornets were analyzed with ML for species determination purposes by Kawakita and Ichikawa [12].A low-pass 12 kHz filter was applied to remove noise before features were Silva et al. [47] adopted a landmark-based approach for bee species determination, examining seven combinations of feature selectors and classifiers.By analyzing 19 landmark positions and generating 84 additional landmark-based features, including centroid size, weight matrix, and relative warp scores, the study employed various feature selection methods-information gain, chi-square, correlation, and Fisher's separation criterion-to refine the dataset.The research assessed classification efficacy using a variety of classifiers: LDA, Naïve Bayes (NB), Logistic, KNN, C4.5, MLP, and SVM.It has been demonstrated that the NB classifier with correlation feature selector yielded the best performance, underscoring the crucial role of feature selection in improving performance.Despite its advances, the study did not surpass the performance previously reported in reference [46], highlighting the significant impact of integrating diverse wing feature data for more precise bee species determination.
A fully automatic approach for bee species determination through wing image analysis was proposed [44], achieving an accuracy of up to 96% for species identification.It enhanced the classification accuracy by integrating color features with patterns derived from the wing's vein skeleton, employing the KNN algorithm for classification.The method significantly minimized manual intervention utilizing automated image processing tech-niques for feature extraction, showcasing the potential of combining different data types for robust classification.
Two publications utilized SL techniques for the determination of bee species by basing their classification on the acoustic sounds produced by the bees.Flight sounds of bees and hornets were analyzed with ML for species determination purposes by Kawakita and Ichikawa [12].A low-pass 12 kHz filter was applied to remove noise before features were extracted using Mel Frequency Cepstrum Coefficient (MFCC) from the sounds.The extracted features were then fed onto an SVM classifier for bee identification.The study highlights that the fundamental frequency of flight sounds varies by species, and the different acoustic properties, such as harmonic components, can significantly differ between species and background noise.Similarly, Ribeiro et al. [49] focused on using ML to recognize tomato-pollinating bees from their buzzing sounds, collecting flight and sonication sounds with a microphone.After preprocessing, MFCC features were extracted and various classifiers were tested, with SVM showing better performance than others, such as LR and RF.However, the overall performance was relatively low.The study emphasized strong relationship between sonication sounds and species, suggesting that further research may be beneficial, especially with larger sample size and different learning systems for improved species determination performance.
Three publications specifically applied SL algorithms for bee species determination based on a movement analysis, with two of these publications utilizing wing beat patterns, as captured by more affordable optical sensors [48,50].Parmezan et al. [48] focused on the hierarchical classification of flying insects, notably bees and wasps, under changing environmental conditions.The research, conducted in Brazilian fields, focused on wingbeat data to extract features related to the energy sum of frequency peaks, the harmonics positions, the spectrum complexity, and the fundamental frequency of wing beats, together with environmental variables such as temperature, humidity, and time of day, noting the significant impacts of temperature and humidity on wing beat sounds.A variety of hierarchical classification approaches, including flat, local (local classifier per node and local classifier per parent node approach), and global approaches, were explored, particularly emphasizing local classifiers for node-based and parent node-based approaches alongside hybrid models.These approaches leverage SL algorithms, KNN, MLP, NB, RF, and SVM, which were chosen for their abilities to manage the complexity and variability of environmental impacts.The local classifier for the node-based approach utilizing MLP gave the highest performance.Additionally wing-beat data, which were influenced by environmental factors such as temperature and humidity, were leveraged.Herrera et al. [50] successfully classified seven hymenopteran species, including Vespa velutina Lepeletier, 1836, bees, and wasps, by employing Power Spectral Density (PSD) features with an RF algorithm.The RF model, utilizing a 14-value PSD peak and valley as features, demonstrated notable classification accuracy.Despite the data collection conducted inside an entomological tent, potentially affecting the flight characteristics of the bees, the recorded fundamental frequency of the wing beats in the study was in agreement with other reported studies.Additionally, the publication aligned with other findings that wingbeat frequencies inversely correlate with body size in Hymenoptera species.Together, these publications showcase sophisticated methods for leveraging SL in ecological studies, underscoring the significant potential for accurate species identification and ecological monitoring through wing-beat analysis amidst environmental fluctuations.
The foraging pattern of bees has been used to discriminate between species of bees, specifically through the analysis of the bees' arrival and departure times as they forage [51].Individual bees were uniquely tagged with Radio Frequency Identification (RFID) transponders, and an RFID reader recorded their movements to and from the hive.Various features, including the foraging frequency at every hour of the day, as well as the sum, median, and standard deviation of different groups of bees, were generated, and fed to three different SL models: RF, a bagging of MLP, and a SVM.The RF algorithm, particularly with features generated by grouping the bees in a group of 12, gave the best accuracy of 87.41% in differ-entiating two species of bees.Despite the potential disturbance caused by attaching the RFID transponders, the approach opens new paths for species differentiation, showcasing the potential of machine learning in utilizing foraging patterns for ecological research.

Studies Utilizing Deep Learning (DL) Only for Species Determination
Deep learning (DL) techniques dominate bee classification research, with a total of twelve publications exclusively utilizing DL techniques.Eleven of the twelve publications leveraged image-based data for species determination, whilst only a single study explored the use of acoustic sound for this task, indicating a strong preference for visual over auditory data in current DL applications within the bee species identification field.
Ferreira et al. [3] aimed to enhance automatic recognition of pollinating bee species using DL models, focusing on their buzzing sounds.They recorded sonication and flight sounds as bees approached flowers, and then transformed these sounds into Log Melspectrograms for analysis by CNNs like EfficientNet V2 and Pre-trained Audio Neural Networks (PANN).Data augmentation techniques such as mixup, SpecAugment, and Randomly Truncated were used to address data imbalance and cross-validation was employed to prevent overfitting.These methods significantly improved the classification performance, with EfficientNet V2 achieving the highest accuracy of 58.04%, highlighting the potential of DL in bioacoustic species identification and the importance of data augmentation and pre-training.
Three publications utilized wing images with DL for the bee species determination task [13,40,43].De Nart et al. [13] considered four different DL methods, InceptionResNet V2, InceptionNet V3, MobileNet V2 and ResNet 50, for the determination of seven subspecies and one hybrid species of honeybees based on their wings.All models achieved over 92% accuracy, with the InceptionNet V3 giving the highest accuracy of 99.12%.Different DL models, including LeNet-5, AlexNet, ResNet50, Inception v3, InceptionResNetV2, VGG-16, and VGG-19, were evaluated for the determination of 19 different species of bees and 10 different species of butterflies based on images of their wings [43], with a particular focus on bees due to the relevance to this review.The bee wing dataset, being notably small and unbalanced, posed a challenge as DL models were especially designed for large datasets.Subsequently, data augmentation techniques, including perspective skewing, elastic distortion, rotation and shearing, and transfer learning, were used to address this challenge.Among the models, the pre-trained InceptionResNetV2 model with data augmentation stood out, achieving an accuracy 94.40% for bee determination, highlighting the benefits of data augmentation and transfer learning in improving performance on relatively small ecologically derived datasets.Buschbacher et al. [40] focused on monitoring insect populations for ecological health, putting forth the identification of species as one of the problems in this domain.DeepABIS, an advancement of the Automated Bee Identification System (ABIS), was introduced for mobile field investigations and species identification of live bees.The system uses advanced CNNs, particularly, the B-CNN, MobileNet V2 and Inception-ResNet.To artificially increase the training dataset, data augmentation techniques, including rotate, zoom, shear, flip vertically (flipUD), and pepper, were adopted, in conjunction with transfer learning.The loss function was adapted by integrating class-level weights, calculated from the sample prevalence, to address class imbalance in the wing bee datasets.Notably, while the Inception-ResNet initially illustrated better performance with an accuracy of 88%, confirming the results of the previous study [43], MobileNet V2 surpassed the performance of the other models with an accuracy of 93.95% when enhanced with transfer learning, albeit only slightly better than the Inception-ResNet-reported accuracy of 93.16% with transfer learning.Additionally, the use of class weighting was also shown to increase accuracy, with a 1.06% increase in the case of MobileNet v2.These approaches underscore the potential of fine-tuned DL models, as well as class weighting, in ecological species identification tasks.In the realm of bee species determination using wing images, deep learning (DL) models have shown superior performance over shallow learning approaches in handling wing-based data, largely attributed to the capacity of DL to manage complex patterns.Among various convolutional neural networks (CNNs) tested, Inception-ResNetV2 consistently demonstrated robust performance across the three studies.Nonetheless, a range of other CNN architectures also yielded commendable results, underscoring the effectiveness of DL techniques in accurately identifying bee species from wing imagery.
Eight publications employed full-body bee images alongside deep learning (DL) techniques for species determination, expanding beyond wing-only imagery.Karthiga et al. [34] employed a 2-layer convolutional neural network (CNN) to classify honeybee species using over 5000 images, aiming to protect and preserve various species by enabling early disease detection.To combat dataset imbalance and enhance the model's training, it incorporates data augmentation techniques such as image zooming, flipping, rotating, and shrinking, alongside the Synthetic Minority Over-sampling Technique (SMOTE) to generate synthetic samples based on KNN.The approach achieved an 86% accuracy rate in species determination, underscoring the potential of automated species determination in conserving bee populations and health.Zhang et al. [38] presented a concatenated approach for classifying bees, insects, wasps, and others using images.It employs Principal Component Analysis (PCA) for the dimensionality reduction of features fed into four pre-trained convolutional neural networks (CNNs): VGG, ResNet, XceptionNet, and EfficientNet, thereby leveraging on transfer learning and the selective reduction of dimension.The outputs were then combined and fed into fully connected layers for the final classification.This technique, emphasizing the effectiveness of PCA efficiency in reducing feature dimensions, demonstrated superior results with an overall accuracy of 96.39% compared to using the individual CNNs alone, except for EfficientNet.Additionally, the concatenated approach surpassed even the EfficientNet model in classifying the others category, showcasing the potential of combining PCA with deep learning for image-based species identification.Four convolutional neural network models, ResNet, Wide ResNet, InceptionV3, and MnasNet, were examined for species determination of 36 North America bumblebee species using over 89,000 images [16].InceptionV3 was selected for its optimal balance of accuracy (91.6%) and speed, despite Wide ResNet's slightly higher accuracy, for the subsequent development of the BeeMachine web app, a simple web application that allows users to identify species using their own images.Incorporating the models' performance nuances, the study found that classification accuracy was influenced by each species' variability of appearance, the volume of training images, and image quality.Additionally, incorporating geographic data has been suggested to enhance model accuracy, acknowledging the variation in the appearance of bumblebees across different regions.This insight emphasizes the complex factors affecting automated identification and suggests integrating spatial data as a potential strategy to improve the precision of models like InceptionV3 and Wide ResNet in ecological applications like BeeMachine.BeeNet, a deep learning model designed for bee surveillance, aims to enhance the feature representation of economically significant bee species and identify objects crucial for bee health monitoring, such as parasites or pollen [37].The model utilizes a variant of the transformer encoder-decoder architecture, incorporating image features extracted from ResNet50.This innovative approach extracts features from RGB images using CNN layers, projects them into linear embeddings to convert them into a sequence, and then embeds the position and feeds it to the transformer encoder.Finally, the output of the transformer encoder is fed into the fully connected layer to yield the results.Trained to identify bees and detect varroa mites and pollen on them, it is an efficient strategy, as the same model makes two different decisions, achieving 92.45% accuracy in bee species determination and up to 99.18% accuracy in detecting fine-grained objects like varroa pests and pollen on bees.BeeNet surpasses other models, including ResNest, EfficientNet and Vision Transformer, in both bee surveillance and health monitoring tasks.
Within the eight publications using images for bee species determination, four publications specifically employed object detection methods.These methods prioritize identifying the precise location of bees within images by delineating their bounding boxes, subse-quently classifying their species.This approach, which integrates object detection as a preliminary step before species determination, enhances the accuracy and efficiency of species identification by focusing on relevant features within designated areas.
Nizam et al. [36] developed a visual-based expert system to aid beekeeping tasks, focusing on the image segmentation of Meliponine bees, a tribe of stingless bees popular in Malaysian beekeeping.The challenge is the accurate detection and identification of these small bees in natural surroundings.The paper employs the Faster R-CNN, a deep learning framework composed of the Region Proposal Network (RPN) and Fast Region-based CNN (Fast R-CNN), to segment Meliponine bees from background imagery.With a dataset of 400 image frames from a local Meliponine farm, the model achieves promising results with a segmentation accuracy of around 74%.This work is a step towards creating a visualbased expert system for bee species identification.Another study also leveraged on Faster R-CNN, but with ResNet101+FPN for bee species determination, particularly focusing on small object detection [32].It tackles the challenge of identifying bees within crowdsourced images from the BeeSpotter database, characterized by varied image qualities and the presence of multiple bees, specifically targeting A. mellifera, Bombus griseocollis DeGeer, 1773, and Bombus impatiens Cresson, 1863.The Feature Pyramid Network (FPN) was utilized within the Faster R-CNN framework.It serves to enhance the model's ability to detect Meliponine bees at various scales by creating a multi-scale feature pyramid from a single input image.The approach enhances identification accuracy by isolating bees using object detection, significantly reducing the need for manual image annotation.Initially trained with hand-annotated images and later augmented with machine-labeled data, this methodology successfully classifies two genera and related bee species, achieving a classification accuracy of 91%.
Hu et al. [42] aimed to address the challenge of accurately detecting and classifying common species, like Chinese bees, wasps, and cockroaches, at beehive gates in natural environments.To solve multi-target and multi-scale problems, the authors proposed an improved RetinaNet target detection network, DY-RetinaNet.This model integrates a symmetric structure Bidirectional Feature Pyramid Network (BiFPN) layer in the place of Feature Pyramid Network (FPN) for multi-scale feature fusion and employs Complete Intersection Over Union (CIOU) loss for precise small target localization.Additionally, a dynamic head framework was added to enhance multi-scale recognition in multi-target scenarios.Experimental results demonstrated the superior network performance of DY-RetinaNet, with a mean average precision (mAP) value of 97.38% when using ResNet-101-BiFPN as the backbone network as compared to other algorithms, including SSD, YOLOV3, Faster R-CNN, Mask R-CNN, FCOS and ExtremeNet, indicating a significant accuracy improvement over the initial RetinaNet model.Some of the authors in the publication [42] utilized the Mini-EfficientDet neural network, optimized through transfer learning, to identify common species at Chinese beehive gates in Fujian Province, including bees, wasps, and cockroaches [33].The Mini-EfficientDet model compresses the original EfficientDet by reducing the multiple stacking of BiFPN layers in the original EfficientDet model to one layer and introduces a category imbalance feature to enhance small target recognition without sacrificing accuracy.These approaches also reduce memory usage.Additionally, the dataset was enhanced by adjusting brightness, applying motion blur based on point scatter principles, and incorporating white Gaussian noise.Evaluations conducted on the MSCOCO2017 dataset and a custom dataset tailored for beehive species demonstrated the effectiveness of the model, with detection accuracies reaching 98.66% for Chinese bees, 83.71% for cockroaches, and 82.06% for wasps.These results establish a solid foundation for the future development of species invasion alert systems.

Publications Utilizing Both Shallow Learning (SL) and Deep Learning (DL) for Species Determination
In the development of a novel model, a combination of both shallow learning (SL) and deep learning (DL) algorithms has also been explored to harness their respective strengths.This model stands out by combining the intricacies of DL with the straightforwardness of SL, aiming to achieve superior predictive performance.Some implementations fed the output from one algorithm type into the other, creating a synergistic effect, while others ran both algorithms in parallel on varied data types, later aggregating their results for the final decision.This innovative approach demonstrates a strategic integration of SL and DL to optimize accuracy and efficiency in ML-based bee identification.
Zhong et al. [39] aimed to detect and quantify flying insects on yellow sticky traps using a Sony IMX219 8-megapixel sensor.Images were downsized from 3280 × 2464 to 448 × 448 pixels to reduce the computational load.The YOLO model, pre-trained on ImageNet and further trained on a dataset of 12,000 labeled images, was used to identify insects.An SVM model, trained with 7000 augmented images, classified insects into seven categories.Local features were extracted using the Histogram of Oriented Gradients (HOG), while global features (shape, texture, and color) were also analyzed.The model's reliability was ensured with 5-fold cross-validation, and the system was deployed on a Raspberry Pi. Results showed that global features alone provided higher accuracy than combined features, and the YOLO-SVM combination achieved 90.18% accuracy, outperforming other methods.This approach effectively monitors pest density in agriculture.
Nasir et al. [35] proposed a novel approach using 3D trajectories to classify bees and non-bees.The study aimed to accurately identify two Vespa species and honeybees by analyzing 456,287 infrared images and 14,565 3D trajectories of A. mellifera, V. velutina, and Vespa orientalis Fabricius, 1793.Three models were trained: a shallow learning model using 3D trajectories from depth maps and RGB images, and two deep learning CNN architectures (Xception and Googlenet) trained on low-resolution IR images.Key features such as time of flight, roaming factor, velocity, and acceleration were extracted and processed using the ensemble bagged trees algorithm.The final decision was based on aggregating predictions from all three models, with class-specific weights assigned based on F-scores.The highest accuracy of 97.1% was achieved by selecting 11 IR images per trajectory for model training, providing a robust method for identifying Vespa species near beehives.
Rodrigues et al. [45] aimed to classify honeybee subspecies using geometric morphometrics of their right forewings.They used images of right forewings annotated with 19 key landmarks captured via a stereomicroscope and digital camera.Techniques included generating masks from landmarks to isolate the wing, enhancing features with Contrast Limited Adaptive Histogram Equalization (CLAHE) and Gaussian filters, and using the SSD MobileNet v1 FPN coco model for wing segmentation.A U-Net model was trained on grayscale images for accurate landmark detection, followed by PCA and Procrustes normalization to reduce geometric variations.An SVM model classified the subspecies, achieving an 86.6% accuracy.The SSD MobileNet v1 FPN coco model had a mean average precision of 0.975, and U-Net showed high landmark detection precision at 0.943.A Flask-based web service was developed to provide access to the model, offering an effective framework for honeybee subspecies determination based on wing morphology.
García et al. [41] evaluated the DeepWings model (the result of earlier research by Rodrigues et al. [45]) using a dataset of 14,816 right forewing images from M-lineage (Apis mellifera iberiensis Engel, 1999 and Apis mellifera mellifera Linnaeus, 1758) and C-lineage (Apis mellifera carnica Pollmann, 1879) subspecies, collected from 2601 colonies across 15 countries.The study aimed to measure the model's performance in identifying colonies, compare these identifications with the distribution of endemic subspecies, and assess the match with molecular markers.Images from 26 subspecies were collected, but the focus was on five: Iberian, Dark, Carniolan, Italian, and Caucasian honeybees.DeepWings classified the images, rejecting some due to low resolution or noise.The model's classifications closely matched the endemic M and C lineages, with 71.4% and 97.6% accuracy, respectively.At the subspecies level, accuracy was 89.7% for Iberian, 41.1% for Dark, and 88.3% for Carniolan honeybees.However, A. carnica and Apis caucasica Gorbachev, 1916 could not be identified at the colony level.The agreement between DeepWings and molecular analyses was weaker in cases of genetic pollution.
Klasen et al. [52] developed an automated species identification system using data augmentation techniques to tackle limited training data.Their dataset included images of butterflies (Parides), bee wing venation (Osmia), and scarab beetle genitalia.The model training involved rotating images, applying Style-Generative Adversarial Network (Style-GAN) for data augmentation, using a pre-trained VGG-16 network with Max pooling for feature generation, applying PCA to reduce overfitting, and using SMOTE for class imbalance.These processed features were then fed into an SVM model.This approach improved accuracy compared to using the original VGG-16 model alone, achieving 80.07% for Pleophylla, 85.16% for Schizonycha, 85.90% for Osmia, and 98.06% for Parides, demonstrating its effectiveness in enhancing species identification with limited data.

Evaluation Metrics Utilized to Measure Performance
An evaluation of machine learning models is essential for ensuring their functionality, robustness, and comparative effectiveness against other solutions.It highlights a model's performance clearly, addressing its strengths and weaknesses.However, this task is challenging due to the complex decision-making processes of models, variability in training data, and differing objectives across projects.Consequently, to overcome these hurdles and gain a comprehensive understanding of a model's capabilities, employing various evaluation metrics is crucial.Therefore, multiple metrics are employed for a nuanced assessment of a model's performance, catering to the specific demands of each unique application [53].
Accuracy is the most utilized performance metric, adopted in twenty-three publications, as shown in Table 3.The metric signifies the proportion of correct predictions relative to the total sample, serving as a fundamental gauge of model performance.A notable adaptation for multi-class scenarios is Top-N accuracy.This variation considers the N most probable classes predicted by the model for each instance; and if the actual class is among these top N predictions, the prediction is deemed correct.Top-N accuracy offers a more flexible yet insightful assessment of performance, and such a measure broadens the evaluation scope, accommodating the complexities of multi-class classifications [40].On a practical level, this measure proves valuable when the most likely bee species are already suggested through other means, such as expert knowledge, specific geographic contexts, or additional technological methods.
Precision, recall (sensitivity or true positive rate), and specificity were used in nine, eight, and one publications, respectively, and offer detailed views of a model's performance beyond what accuracy alone can provide.Zhang et al. [38] evaluated models with these metrics and found that although ResNet50 demonstrated a slightly higher accuracy (0.9468) compared to VGG's accuracy (0.9449), VGG exhibited a superior precision of 0.9448 as compared to ResNet50's precision of 0.9426.This makes VGG a better choice when the accurate identification of the positive class is critical, despite its marginally lower overall accuracy compared to ResNet50.Indeed, precision is commonly used when the consequences of falsely identifying negatives as positives (false positives) are severe.For instance, identifying invasive bee species accurately is crucial to prevent them from being mistakenly classified as harmless native species, which could lead to inappropriate management decisions, and in such case, high precision ensures that measures to control or eradicate invasive species are only applied to the correct targets, thereby protecting the native bee populations and maintaining ecological balance.On the other hand, recall is crucial when it is important to identify every possible bee of a specific bee species, particularly in situations where missing the classification of a single bee carries significant consequences.For instance, recall becomes important when identifying species known to produce toxic honey.Ensuring that no such species are missed in the identification process and false positive rates on the x-axis across different thresholds, enabling researchers to visualize performance characteristics.However, the AUC may not be the ideal metric for imbalanced datasets due to its inherent limitations.To address this, average precision (AP) emerges as a valuable alternative, calculated from the area under the precision-recall curve.The precision-recall curve, depicting precision on the y-axis and recall on the x-axis for varying thresholds, offers a nuanced evaluation particularly suited for highly imbalanced data scenarios.AP's prominence lies in its utility for measuring the performance of object detection models, where imbalances in dataset distribution are prevalent and accurate identification of positive instances is paramount [33,42].In addition, a confusion matrix is commonly used to show the class-wise performance in more detail.
All of the aforementioned metrics have been used to measure the performance of models in species identification tasks.In an adaptation to the precision measurement, Rodrigues et al. [45] utilized positional precision to gauge the ability of model to identify landmarks, with the accuracy of the landmark position evaluated by contrasting the landmark configurations extracted from the U-Net output with those manually marked in study [45].Subsequently, the Euclidian distances between pairs of landmarks were summed up and normalized.A maximum precision score of one was attained when the total distance equaled zero.
While performance metrics gauge the efficacy of models, they do not illuminate the specific contributions of the different components of the model towards its determination task.An ablation study has been utilized to assess the impact of specific modifications on machine learning (ML) architectures.Drawing inspiration from experimental neuropsychology, ablation studies in ML involve systematically removing individual components or modules from a model to evaluate the performance impact of each component.By measuring how the performance changes without these components, researchers can discern their importance and contribution to the model's overall effectiveness [33,42].Moreover, to demystify the 'black box' nature of deep learning, explainable AI techniques like Grad-CAM have been employed [38].These techniques clarify the influence of specific image pixels on decision-making processes, enhancing our understanding of how models classify data.

Discussion
Out of 281 publications reviewed from WoS and Scopus, 26 publications were selected, with the first one dating to 2001, by employing shallow learning techniques.Over time, both shallow and deep learning algorithms have been increasingly utilized for bee identification and species determination tasks, with a notable rise in the use of DL.This trend is reflected in the growing number of publications and citations annually in this field.Interestingly, keywords related to SL, DL, and computer vision, which were not part of the search terms, have emerged, highlighting the significant role of computer vision in bee identification research.Brazil stands out for its high number of collaborative studies, while Portugal is recognized for collaborating with the largest number of countries.Notably, CEDRI has conducted significant collaborative research in bee identification.These findings underscore the increasing scholarly interest in this area, as evidenced by the rising number of publications and citations.
Recent advancements in data collection methods have minimized their impacts on bees, leading to an increase in image-based datasets compared to other types.Many of these image datasets have now been made publicly available, likely contributing to the surge in image-classification research related to bees.Most research efforts have focused on honeybees, with relatively few studies addressing stingless bees, despite the increase popularity of bee products related to stingless bees.While the diversity of bee species is globally recognized, the data supporting this diversity are often limited to a few countries.To address this data scarcity, innovative crowdsourcing methods have been introduced, enabling the gathering of data worldwide.This approach not only expands the dataset but also increases its diversity, which are essential for developing robust machine learning models.Moreover, with access to publicly available datasets, researchers can focus more on refining machine learning models rather than on the arduous task of data collection.Admittedly, however, datasets are still limited, both in terms of volume and the diversity of species.
The image-based data type has emerged as the predominant method in bee research studies, with a lesser focus on acoustic and movement data.SL techniques, particularly using images, have been widely utilized for analyzing bee wings.This method capitalizes on the straightforwardness of extracting features and distinguishing the venation patterns on wings.Most of these studies leverage landmarks that are easily identifiable through image processing techniques and accessible even to non-experts.Various research efforts have employed landmarks, alongside color and SIFT features, utilizing SL models like SVM, MLP, and KNN for species determination tasks.The development of CNN models has advanced bee identification further, shifting from wing-only images to full-body imaging.This progression not only lessens the impact on bees during data collection but also enables these deep learning-based models to classify an impressive range of up to 129 species with remarkable accuracy.Combinations of SL and DL models have also been employed in both sequential and parallel manners.In the sequential approach, a specific part of the identification process is managed by one model, with its findings utilized by another, whereas, in the parallel approach, models are fed the same or different datasets, with the outcomes eventually combined.This strategic application of SL and DL models enhances the precision and efficiency of bee species identification processes.
Three studies focus on utilizing flying and sonication sounds for bee species identification.Among them, two employ shallow learning techniques, utilizing MFCC features and SVM classification.However, the third study utilizes Log Mel-spectrogram features with a pre-trained EfficientNet V2 model.Despite this advanced approach, its performance is notably inferior to the other two studies.
Movement patterns of bees serve as a distinguishing feature for their identification and species determination, leading to the utilization of this criterion in three publications.These studies exclusively focus on analyzing the movement of bee wings or the bees themselves.Employing shallow learning (SL) algorithms such as MLP and Random Forest (RF), these models achieve higher performance compared to sound-based models.However, despite their effectiveness, they are limited in their scope, classifying fewer bee species.This highlights the potential for further research to enhance accuracy and expand the coverage of movement-based bee identification models.
While computer vision techniques have been extensively employed in stationary bee image classifications, there is a notable scarcity of studies proposing computer vision-based approaches to flying bees.Among these, a single study stands out for its innovative methodology, utilizing 3D trajectories classified by SL algorithms and IR images classified by DL models, with the integration of SL and DL techniques allowing for a comprehensive analysis of bee behavior and morphology.The final results were obtained through the aggregation of outcomes from both models.
The high performance of the image-based bee identification model underscores its potential for development into mobile or web-based applications that allow stakeholdersincluding novice researchers, breeders, and conservationists-to identify bee species from uploaded images.This technology not only democratizes access to expert-level species identification but also enhances the accuracy and efficiency of monitoring bee populations and their health.
The advancement of the multi-evidence-based model, which now includes capabilities for identifying bees in flight, offers significant benefits for beekeepers.It enables real-time monitoring of bee activities and the identification of potential threats such as invasive species or pathogens.This is particularly advantageous for proactive bee conservation efforts and detailed ecological research, as it allows for immediate intervention and informed decision-making.Furthermore, enhancements in the acoustic-based system contribute substantially to our understanding of bee pollination patterns.By accurately capturing and analyzing the sounds associated with bees pollinating specific flowers, this system can pro-vide invaluable data for botanists and agricultural professionals.The insights gained can lead to improved crop management strategies and contribute to broader ecological health by ensuring effective pollination, which is crucial for the survival of many plant species.
The ongoing development of species identification methods using diverse approachesvisual, acoustic, and behavioral-promises to expand the toolkit available for various stakeholders.These tools are not only pivotal for research and conservation but also have practical applications in agriculture and bio-surveillance.The integration of these technologies into user-friendly platforms can significantly enhance engagement and participation in bee conservation initiatives, contributing to sustainable practices and biodiversity preservation.In light of these developments, future research should focus on refining these models, expanding their applicability under diverse environmental conditions, and enhancing their integration with existing digital agriculture and conservation tools.This will further solidify the role of advanced machine learning techniques in environmental science and ecological monitoring.
Recent advancements in bee identification methods have significantly reduced the need for human intervention, marking a pivotal shift toward automation.The evolution of computational technologies has empowered the integration of DL techniques for bee species determination, while the adoption of innovative approaches such as the Visual Transformer method signifies ongoing updates in ML technologies for bee identification.Image-based approaches have shown higher bee species determination accuracies, benefiting from pretrained models, ample data availability, and the progress of computer vision.Nonetheless, challenges persist in achieving real-time identification and addressing the limited research focusing on flying bees, indicating areas for future exploration and improvement in bee identification and species determination methodologies.

Conclusions
This review underscores the growing scholarly interest in bee species identification, as evidenced by the increasing number of publications and citations.Researchers have employed a variety of machine learning techniques, including SL, DL, and hybrid approaches, to delve into different facets of bee species determination.These range from analyzing images of bees and their wings to studying various acoustic signals and movement patterns.Notably, research on stingless bees remains limited compared to that on honeybees, primarily due to the scarcity of labeled data, with image-based approaches predominating.Additionally, innovative methods such as the use of 3D trajectories for species determination are highlighting continual advancements in the field.
While the existing literature indicates promising accuracy in identifying non-flying bee species, classifying flying bees continues to pose significant challenges.Further advancements in models that integrate multiple types of evidence are necessary to overcome these challenges.
These findings provide valuable guidance for future research on bee identification, offering insights into essential keywords, author contributions, institutional affiliations, countries of origin, datasets used, and methodologies employed.Systems developed for bee species identification have the potential to significantly enhance bee counters, surveillance systems, and monitoring tools, thereby enriching our understanding of bee behavior and contributing to the field of entomology.

Supplementary Materials:
The following supporting information can be downloaded at https: //www.mdpi.com/article/10.3390/asi7040062/s1:Table S1.Summary of publications related to shallow learning; Table S2.Summary of publications related to deep learning; Table S3.Summary of studies related to combinations of shallow and deep learning; Figure S1.Four tribes among 19 tribes of the Apinae subfamily with an example species for each tribe.Refs.[3,12,13,16,20, are cited in Supplementary Materials.
Appl.Syst.Innov.2024, 7, x FOR PEER REVIEW 5 of 28 publications were considered for the screening stage.A flow chart of the PRISMA declaration figure showing the flow of the search is presented in Figure 1.

Figure 1 .
Figure 1.PRISMA flow diagram for the systematic identification, screening, eligibility, and inclusion of publications.

Figure 1 .
Figure 1.PRISMA flow diagram for the systematic identification, screening, eligibility, and inclusion of publications.

Figure 2 .
Figure 2. Yearly publications on bee species determination tasks using machine learning.

Figure 3 .
Figure 3. Annual citations of the publications.

Figure 2 .
Figure 2. Yearly publications on bee species determination tasks using machine learning.

Figure 2 .
Figure 2. Yearly publications on bee species determination tasks using machine learning.

Figure 3 .
Figure 3. Annual citations of the publications.

Figure 6 .
Figure 6.The contributions of various countries.

Figure 6 .
Figure 6.The contributions of various countries.

Figure 8 .
Figure 8. Number of publications utilizing different methods for bee species determination.

Figure 8 .
Figure 8. Number of publications utilizing different methods for bee species determination.

Figure 9 .
Figure 9. Summary of datasets, with the size of the bubble representing the number of records in the dataset.The publication number corresponds to the publication as referenced in Tables S1-S3.

Figure 10 .
Figure 10.Venn diagram of the species categories utilized in different publications.

Figure 9 .
Figure 9. Summary of datasets, with the size of the bubble representing the number of records in the dataset.The publication number corresponds to the publication as referenced in Tables S1-S3.

Figure 9 .
Figure 9. Summary of datasets, with the size of the bubble representing the number of records in the dataset.The publication number corresponds to the publication as referenced in Tables S1-S3.

Figure 10 .
Figure 10.Venn diagram of the species categories utilized in different publications.

Figure 10 .
Figure 10.Venn diagram of the species categories utilized in different publications.

Figure 11 .
Figure 11.Number of publications utilizing data from different countries.

Figure 11 .
Figure 11.Number of publications utilizing data from different countries.

28 Figure 12 .
Figure 12.Sample wing of a honeybee used for the species determination task.

Figure 12 .
Figure 12.Sample wing of a honeybee used for the species determination task.

Table 1 .
Keywords used in the search.
(Bee or Apidae or Apis or Meloponine *) and (Species or Genus * or Genera or Type * or Subspecies) not (Artificial Bee Colony)) (Title, Abstract, Keywords) Species Determination Search (Classification or Discrimination or Identification or Taxonomy) (Title, Abstract, Keywords)

Table 2 .
Co-occurrence of keywords.

Table 2 .
Co-occurrence of keywords.