DeepWings©: Automatic Wing Geometric Morphometrics Classiﬁcation of Honey Bee ( Apis mellifera ) Subspecies Using Deep Learning for Detecting Landmarks

: Honey bee classiﬁcation by wing geometric morphometrics entails the ﬁrst step of manual annotation of 19 landmarks in the forewing vein junctions. This is a time-consuming and error-prone endeavor, with implications for classiﬁcation accuracy. Herein, we developed a software called DeepWings© that overcomes this constraint in wing geometric morphometrics classiﬁcation by automatically detecting the 19 landmarks on digital images of the right forewing. We used a database containing 7634 forewing images, including 1864 analyzed by F. Ruttner in the original delineation of 26 honey bee subspecies, to tune a convolutional neural network as a wing detector, a deep learning U-Net as a landmarks segmenter, and a support vector machine as a subspecies classiﬁer. The implemented MobileNet wing detector was able to achieve a mAP of 0.975 and the landmarks segmenter was able to detect the 19 landmarks with 91.8% accuracy, with an average positional precision of 0.943 resemblance to manually annotated landmarks. The subspecies classiﬁer, in turn, presented an average accuracy of 86.6% for 26 subspecies and 95.8% for a subset of ﬁve important subspecies. The ﬁnal implementation of the system showed good speed performance, requiring only 14 s to process 10 images. DeepWings© is very user-friendly and is the ﬁrst fully automated software, offered as a free Web service, for honey bee classiﬁcation from wing geometric morphometrics. DeepWings© can be used for honey bee breeding, conservation, and even scientiﬁc purposes as it provides the coordinates of the landmarks in excel format, facilitating the work of research teams using classical identiﬁcation approaches and alternative analytical tools.


Introduction
The western honey bee (Apis mellifera L.) differentiated into 31 subspecies in its native range in Eurasia and Africa [1][2][3][4]. The great majority of this variation was recognized early by the father of honey bee taxonomy, Friedrich Ruttner, who identified 24 subspecies using 36 morphological traits derived from pilosity, pigmentation, length of different body parts, and wing venation. Analysis of the 36 traits still represents the gold standard of honey bee classification and is required for scientific applications, as the description of new subspecies [5]. However, measuring and analyzing all 36 traits is labor-intensive and involves expert knowledge, making classical morphometry unsuitable for colony identification for commercial, conservation, or even scientific purposes. To circumvent this limitation, efforts have recently been made to simplify and automate the identification of honey bee subspecies [6][7][8][9]. Yet, a tool for simple, fast, and inexpensive honey bee classification remains unavailable for use by the beekeeping community.

Background
Classical morphometry has been replaced by labor-effective alternatives based on the forewing shape patterns, which are typically assessed on honey bee workers (infertile females). The forewings, specifically the vein junctions, carry high-information content and are therefore of great value for the identification of honey bee subspecies [5,[10][11][12]. This feature, together with the quasi-two-dimensional structure of forewings, makes this body part well-suited for computerizing and automating honey bee classification using image analysis.
A suite of forewing traits is used, singly or in conjunction, in the identification of honey bee subspecies [5]. Among these are the cubital index, the hantel index, and the discoidal shift angle [13], which can be measured on wing images by the semi-automatic proprietary software CBeeWing (https://www.cybis.se/index.htm; accessed on 26 April 2022). This software is commonly employed by beekeepers engaged in the conservation of the endangered Apis mellifera mellifera subspecies in northern Europe [14]. However, by using a limited number of traits, CBeeWing does not take full advantage of the information content carried by the wing shape, making identification less accurate. On the other side of the spectrum is the very intensive DAWINO (Discriminant Analysis with Numerical Output) method, which requires measurements of 30 forewing characters encompassing vein angles, vein lengths, and indexes [5].
Wing shape analysis based on wing geometry, dubbed wing geometric morphometrics, offers an interesting alternative for honey bee identification [5,10]. Wing geometric morphometrics is widely used in insect taxonomy in general and was revealed as particularly useful for identifying bee species [15][16][17][18][19] and honey bee lineages and subspecies for a wide range of purposes [6,8,11,12,[20][21][22][23][24][25]. Using this method, wing shape variations are captured by 19 data points, known as landmarks, acquired from the vein junctions annotated in images (Figure 1a,b). Given that the locations of the 19 landmarks are subspecies-specific, deviations in positional coordinates can be used in honey bee identification (Figure 1c,d). Most of the 19 landmarks are also employed for calculating all vein lengths and angles by both the classical morphometry and the DAWINO methods [5]. To the best of our knowledge, the only system publicly available for honey bee identification by wing geometric morphometrics is implemented by the software IdentiFly [26]. The problem is that IdentiFly is a semi-automatic software requiring several steps for wing classification, often including manual correction of landmark annotations, making it very difficult for routine use by queen breeders or beekeepers. junctions annotated in images (Figure 1a,b). Given that the locations of the 19 landmarks are subspecies-specific, deviations in positional coordinates can be used in honey bee identification (Figure 1c,d). Most of the 19 landmarks are also employed for calculating all vein lengths and angles by both the classical morphometry and the DAWINO methods [5]. To the best of our knowledge, the only system publicly available for honey bee identification by wing geometric morphometrics is implemented by the software IdentiFly [26]. The problem is that IdentiFly is a semi-automatic software requiring several steps for wing classification, often including manual correction of landmark annotations, making it very difficult for routine use by queen breeders or beekeepers. The most recent advance in honey bee classification comes from the application of artificial intelligence through techniques of machine learning [9]. De Nart and colleagues [9] based their classification method on the entire wing and used Convolutional Neural Networks (CNN) to develop an end-to-end solution for resolving the images. Unfortunately, this new tool is limited to the classification of only seven honey bee subspecies and it was not made available for use by the scientific or beekeeping community.
In a global world, maintaining the genetic integrity of native honey bee subspecies is becoming increasingly demanding. In this context, it is important that queen breeders engaged in honey bee conservation programs have tools at their disposal for colony identification. While molecular tools offer the most accurate solution for subspecies identification [27,28], they are not affordable for most queen breeders, in which case morphometry becomes the only option. Here, we filled a gap in the geometric morphometrics identification of honey bees by developing a fully automated software that is easy to use and is freely available as a Web service. To that end, we implemented machine learning techniques for (i) detecting wings using CNN, (ii) segmenting landmarks using U-Net, and (iii) classifying models using a support vector machine (SVM). Using this multi-step approach, we addressed three main questions: (i) Can the CNN-based wing detector handle images with multiple wings of varying orientations, ensuring the uniformity of the wing shape pattern at the entrance of the landmarks segmenter? (ii) Is the U-Net capable of extracting the taxonomically informative traits from the wing venation with a high level of precision, as required for accurate subspecies classification? (iii) How efficient is SVM at generalizing the subtle differences between the closely related honey bee subspecies and hence ensuring accurate classification? The most recent advance in honey bee classification comes from the application of artificial intelligence through techniques of machine learning [9]. De Nart and colleagues [9] based their classification method on the entire wing and used Convolutional Neural Networks (CNN) to develop an end-to-end solution for resolving the images. Unfortunately, this new tool is limited to the classification of only seven honey bee subspecies and it was not made available for use by the scientific or beekeeping community.
In a global world, maintaining the genetic integrity of native honey bee subspecies is becoming increasingly demanding. In this context, it is important that queen breeders engaged in honey bee conservation programs have tools at their disposal for colony identification. While molecular tools offer the most accurate solution for subspecies identification [27,28], they are not affordable for most queen breeders, in which case morphometry becomes the only option. Here, we filled a gap in the geometric morphometrics identification of honey bees by developing a fully automated software that is easy to use and is freely available as a Web service. To that end, we implemented machine learning techniques for (i) detecting wings using CNN, (ii) segmenting landmarks using U-Net, and (iii) classifying models using a support vector machine (SVM). Using this multi-step approach, we addressed three main questions: (i) Can the CNN-based wing detector handle images with multiple wings of varying orientations, ensuring the uniformity of the wing shape pattern at the entrance of the landmarks segmenter? (ii) Is the U-Net capable of extracting the taxonomically informative traits from the wing venation with a high level of precision, as required for accurate subspecies classification? (iii) How efficient is SVM at generalizing the subtle differences between the closely related honey bee subspecies and hence ensuring accurate classification?  Figure 2 depicts in four frames the modulation employed in this work to achieve the final solution. The top-left frame shows the organization of the datasets that were used to train the three machine learning models: the wing detector, the landmarks segmenter, and the wing classifier. These models were trained using the configurations displayed on the remaining frames and were fed with the data originating from the top-left frame.

Modelling of the Solution
The bottom-left frame shows the organization of the wing detector, in which the wing images were the inputs and the positions of the bounding boxes (surrounding the manually annotated landmarks) were the targets for the CNN learning models. The bottom-right frame shows the organization of the landmarks segmenter, in which the U-Net inputs were loaded with the wing images and the U-Net outputs were loaded with the corresponding landmark images (masks). Finally, the top-right frame shows the intervening elements of the subspecies classifier. At this stage of the training, the landmarks previously annotated (bottom-right frame) were processed (PCA-Sorting-Procrustes) to assure their geometric stabilization before entering the SVM classifier. The One-Hot encoding was used in the SVM outputs, in order to represent the wings of the different subspecies.  Figure 2 depicts in four frames the modulation employed in this work to achieve the final solution. The top-left frame shows the organization of the datasets that were used to train the three machine learning models: the wing detector, the landmarks segmenter, and the wing classifier. These models were trained using the configurations displayed on the remaining frames and were fed with the data originating from the top-left frame. The bottom-left frame shows the organization of the wing detector, in which the wing images were the inputs and the positions of the bounding boxes (surrounding the manually annotated landmarks) were the targets for the CNN learning models. The bottom-right frame shows the organization of the landmarks segmenter, in which the U-Net inputs were loaded with the wing images and the U-Net outputs were loaded with the corresponding landmark images (masks). Finally, the top-right frame shows the intervening elements of the subspecies classifier. At this stage of the training, the landmarks previously annotated (bottom-right frame) were processed (PCA-Sorting-Procrustes) to assure their geometric stabilization before entering the SVM classifier. The One-Hot encoding was used in the SVM outputs, in order to represent the wings of the different subspecies.

Image Datasets
Two sets of images were used to train the system developed in this study. Dataset 1 comprised 5770 images of the right forewing of honey bee workers. The wings were photographed from mounts on microscopic slides (~10 wings per slide) using a stereomicroscope attached to a digital camera with a resolution of 1000 pixels per centimeter. These images were manually annotated for the 19 landmarks by one single operator, following the positional order portrayed in Figure 1a. Of the 5770 images, 3518 were collected from Iberian colonies of Apis mellifera iberiensis identified by genetic data [25,29] whereas 2252 were collected from Azorean colonies of mixed ancestry [25]. Dataset 1 was annotated for the population analysis carried out by Ferreira and colleagues [25].
Dataset 2 comprised 1864 images of the right forewing of honey bee workers obtained from the Morphometric Bee Data Bank in Oberursel, Germany, where the original

Image Datasets
Two sets of images were used to train the system developed in this study. Dataset 1 comprised 5770 images of the right forewing of honey bee workers. The wings were photographed from mounts on microscopic slides (~10 wings per slide) using a stereomicroscope attached to a digital camera with a resolution of 1000 pixels per centimeter. These images were manually annotated for the 19 landmarks by one single operator, following the positional order portrayed in Figure 1a. Of the 5770 images, 3518 were collected from Iberian colonies of Apis mellifera iberiensis identified by genetic data [25,29] whereas 2252 were collected from Azorean colonies of mixed ancestry [25]. Dataset 1 was annotated for the population analysis carried out by Ferreira and colleagues [25].
Dataset 2 comprised 1864 images of the right forewing of honey bee workers obtained from the Morphometric Bee Data Bank in Oberursel, Germany, where the original specimens used by F. Ruttner [1] in his seminal taxonomic work are deposited. These images were taken from mounts on microscopic slides using a stereomicroscope attached to a digital camera with a resolution of 650 pixels per centimeter. Dataset  Contrary to the high quality of most images in dataset 1 (example in Figure 1a), dataset 2 contained numerous images with visual artifacts, noise, pose variations, illumination variations, and specular light reflections (examples in Figure 3), showing the difficulty of generalizing a solution for a reliable classification system. Furthermore, most images in both datasets contained more than one wing. Contrary to the high quality of most images in dataset 1 (example in Figure 1a), dataset 2 contained numerous images with visual artifacts, noise, pose variations, illumination variations, and specular light reflections (examples in Figure 3), showing the difficulty of generalizing a solution for a reliable classification system. Furthermore, most images in both datasets contained more than one wing. The two datasets were aggregated and then split into three subsets according to the following ratios: 80% for the training subset, 10% for the validation subset, and 10% for the testing subset. These three new datasets were randomly built from the original aggregated dataset. The training dataset was used for the training process, the validation dataset was used to tune the parameters of the machine learning models, and the testing dataset was used to assess the final functional performance of the machine learning models.

Masks
Masks of the size of the input images were created by an algorithm based on the manually marked landmarks from dataset 1, with the pixels forming the landmarks denoted in white and the background in black. These masks were the targets of the U-Net neural network [30]. Figure 4 provides examples of the output masks created for the U-Net training. The two datasets were aggregated and then split into three subsets according to the following ratios: 80% for the training subset, 10% for the validation subset, and 10% for the testing subset. These three new datasets were randomly built from the original aggregated dataset. The training dataset was used for the training process, the validation dataset was used to tune the parameters of the machine learning models, and the testing dataset was used to assess the final functional performance of the machine learning models.

Masks
Masks of the size of the input images were created by an algorithm based on the manually marked landmarks from dataset 1, with the pixels forming the landmarks denoted in white and the background in black. These masks were the targets of the U-Net neural network [30]. Figure 4 provides examples of the output masks created for the U-Net training.

Data Augmentation
Dataset 2 was considerably smaller than dataset 1 and comprised numerous images representing a large spectrum of visual variations (Figure 3), contrasting with the high image quality of dataset 1 (Figure 1a). To accommodate the problem of unbalanced datasets, dataset 2 was artificially expanded. Employment of a data augmentation approach [31] enabled the construction of a large database for increased precision of the automatic

Data Augmentation
Dataset 2 was considerably smaller than dataset 1 and comprised numerous images representing a large spectrum of visual variations ( Figure 3), contrasting with the high image quality of dataset 1 (Figure 1a). To accommodate the problem of unbalanced datasets, dataset 2 was artificially expanded. Employment of a data augmentation approach [31] enabled the construction of a large database for increased precision of the automatic landmarks segmenter. To accommodate the problem of image variations, specific visual features were simulated, including dust, noise, and drastic angle changes ( Figure 5).

Processing and Analyzing Wing Images
The processing and analysis of wing images encompassed three major stages: preprocessing, landmark extraction, and classification ( Figure 6). The system must be capable of handling different variables of the images, such as wing pose, image size, and multiple wings in a single image.

Data Augmentation
Dataset 2 was considerably smaller than dataset 1 and comprised numerous images representing a large spectrum of visual variations (Figure 3), contrasting with the high image quality of dataset 1 (Figure 1a). To accommodate the problem of unbalanced datasets, dataset 2 was artificially expanded. Employment of a data augmentation approach [31] enabled the construction of a large database for increased precision of the automatic landmarks segmenter. To accommodate the problem of image variations, specific visual features were simulated, including dust, noise, and drastic angle changes ( Figure 5).

Processing and Analyzing Wing Images
The processing and analysis of wing images encompassed three major stages: preprocessing, landmark extraction, and classification ( Figure 6). The system must be capable of handling different variables of the images, such as wing pose, image size, and multiple wings in a single image.

Processing and Analyzing Wing Images
The processing and analysis of wing images encompassed three major stages: preprocessing, landmark extraction, and classification ( Figure 6). The system must be capable of handling different variables of the images, such as wing pose, image size, and multiple wings in a single image.

Preprocessing
Preprocessing comprises several steps needed for curating the images for subsequent detection of the 19 landmarks. These steps included the application of filters, wing detection, wing cropping, and wing size normalization ( Figure 6). After opening the images, two filters were applied: a CLHAE (Contrast Limited Adaptive Histogram Equalization) filter [32], to highlight important image features, and a Gaussian filter, to remove noise. The images were resized to a static value and a CNN-based [33] object detector, capable of perceiving the existence of each wing within an image, was used. This approach enabled image cropping in a normalized manner (Figure 7). Several object detector models were tested, including SSD MobileNet v1 FPN coco [34], Faster R-CNN NAS [35], Faster R-CNN Inception Resnet v2 Atrous Coco [36], and YoloV3 [37]. Big Data Cogn. Comput. 2022, 6, x FOR PEER REVIEW 7 of 21 Figure 6. Pipeline developed for the processing of forewing images for honey bee classification.

Preprocessing
Preprocessing comprises several steps needed for curating the images for subsequent detection of the 19 landmarks. These steps included the application of filters, wing detection, wing cropping, and wing size normalization ( Figure 6). After opening the images, two filters were applied: a CLHAE (Contrast Limited Adaptive Histogram Equalization) filter [32], to highlight important image features, and a Gaussian filter, to remove noise. The images were resized to a static value and a CNN-based [33] object detector, capable of perceiving the existence of each wing within an image, was used. This approach enabled image cropping in a normalized manner (Figure 7). Several object detector models were tested, including SSD MobileNet v1 FPN coco [34], Faster R-CNN NAS [35], Faster R-CNN Inception Resnet v2 Atrous Coco [36], and YoloV3 [37].

Preprocessing
Preprocessing comprises several steps needed for curating the images for subsequent detection of the 19 landmarks. These steps included the application of filters, wing detection, wing cropping, and wing size normalization ( Figure 6). After opening the images, two filters were applied: a CLHAE (Contrast Limited Adaptive Histogram Equalization) filter [32], to highlight important image features, and a Gaussian filter, to remove noise. The images were resized to a static value and a CNN-based [33] object detector, capable of perceiving the existence of each wing within an image, was used. This approach enabled image cropping in a normalized manner (Figure 7). Several object detector models were tested, including SSD MobileNet v1 FPN coco [34], Faster R-CNN NAS [35], Faster R-CNN Inception Resnet v2 Atrous Coco [36], and YoloV3 [37].  None of the datasets had annotated wing bounding boxes. However, it was possible to infer bounding boxes using the landmark coordinates of the training dataset ( Figure 8). Prior to training the wing detection, the target landmarks were used to delineate the wing region and then the bounding boxes were inferred from the peripherical landmarks. A spatial margin around the landmarks ensured that they would fall inside the bounding box.
None of the datasets had annotated wing bounding boxes. However, it was possible to infer bounding boxes using the landmark coordinates of the training dataset ( Figure 8). Prior to training the wing detection, the target landmarks were used to delineate the wing region and then the bounding boxes were inferred from the peripherical landmarks. A spatial margin around the landmarks ensured that they would fall inside the bounding box.

Landmark Detection
Landmark detection comprises several steps, including implementation of the landmark detector, Blob detection, adjustment of mask angle, extraction and sorting of landmarks ( Figure 6). The U-Net detector was revealed to be more precise in segmenting the 19 landmarks than a classification CNN and the classical approaches (e.g., adaptive binarization) available in the OpenCV library [38]. The U-Net architecture consists of a contracting path to capture context and a symmetric expanding path, which enables precise landmark positioning by extracting their mass centers ( Figure 9). The U-Net was trained using a GTX1080ti GPU and the Keras deep learning framework [39]. The images for the U-Net training were converted to greyscale, as the searched landmark patterns do not depend on color. This helped save computational memory and reduce the "curse of dimensionality" of the neural model. Because the wing images had variations in the positional angles, the cropped images tended to exhibit different sizes

Landmark Detection
Landmark detection comprises several steps, including implementation of the landmark detector, Blob detection, adjustment of mask angle, extraction and sorting of landmarks ( Figure 6). The U-Net detector was revealed to be more precise in segmenting the 19 landmarks than a classification CNN and the classical approaches (e.g., adaptive binarization) available in the OpenCV library [38]. The U-Net architecture consists of a contracting path to capture context and a symmetric expanding path, which enables precise landmark positioning by extracting their mass centers ( Figure 9). The U-Net was trained using a GTX1080ti GPU and the Keras deep learning framework [39].
None of the datasets had annotated wing bounding boxes. However, it was possible to infer bounding boxes using the landmark coordinates of the training dataset ( Figure 8). Prior to training the wing detection, the target landmarks were used to delineate the wing region and then the bounding boxes were inferred from the peripherical landmarks. A spatial margin around the landmarks ensured that they would fall inside the bounding box.

Landmark Detection
Landmark detection comprises several steps, including implementation of the landmark detector, Blob detection, adjustment of mask angle, extraction and sorting of landmarks ( Figure 6). The U-Net detector was revealed to be more precise in segmenting the 19 landmarks than a classification CNN and the classical approaches (e.g., adaptive binarization) available in the OpenCV library [38]. The U-Net architecture consists of a contracting path to capture context and a symmetric expanding path, which enables precise landmark positioning by extracting their mass centers ( Figure 9). The U-Net was trained using a GTX1080ti GPU and the Keras deep learning framework [39]. The images for the U-Net training were converted to greyscale, as the searched landmark patterns do not depend on color. This helped save computational memory and reduce the "curse of dimensionality" of the neural model. Because the wing images had variations in the positional angles, the cropped images tended to exhibit different sizes The images for the U-Net training were converted to greyscale, as the searched landmark patterns do not depend on color. This helped save computational memory and reduce the "curse of dimensionality" of the neural model. Because the wing images had variations in the positional angles, the cropped images tended to exhibit different sizes ( Figure 8). Moreover, the U-Net input size needed to be static (400 × 400 inputs) and the image width-height ratio of the wing could not be altered in order to avoid pattern deformations. Therefore, the wing image was stacked on a black background where the horizontal axis of the wing image was scaled to the limits of that black layer ( Figure 10).  Figure 8). Moreover, the U-Net input size needed to be static (400 × 400 inputs) and the image width-height ratio of the wing could not be altered in order to avoid pattern deformations. Therefore, the wing image was stacked on a black background where the horizontal axis of the wing image was scaled to the limits of that black layer ( Figure 10). During the U-Net training, it became evident that using only one pixel to represent each landmark was not a suitable solution. Therefore, to increase the signal-to-noise ratio, the U-Net received a small circle (synthetic landmark) for each landmark, reinforcing the landmark as the target region for the U-Net. The synthetic landmarks taken by the U-Net generated the regions illustrated in Figure 11, after the training. As expected, these regions are not a perfect circle, contrary to the U-Net target masks shown in Figure 4. When the U-Net segments the 19 landmarks, it does not identify each one in a differentiated manner. This is a critical issue because the classifier can only make the right decision if the order of the landmarks displayed in Figure 1b is kept during the processing runs. The first step to ensure a standardized landmarks extraction was based on Principal Components Analysis [40], which identifies the largest eigenvector that gives the angle of the landmarks' distribution. Then, the rotation of the entire mask following that angle allows having the landmarks rotationally aligned to the horizontal axis. This procedure simplified the mechanism of extracting the landmarks in a standard order because their patterns could be scanned in a consistent pose.
From the masks generated by the U-Net, it was then required to extract the mass center of the detected landmark regions. The most straightforward way to do this was to compute the center of the segmented regions using the Blob Detector implemented in the OpenCV library. The total number of expected landmarks (19) was used to validate segmentation; if that number was not confirmed, the process would fail. Following detection, all 19 landmarks were sorted out according to their positional relationship (Figure 1b) to enable accurate classification. This step was accomplished by placing the 19 During the U-Net training, it became evident that using only one pixel to represent each landmark was not a suitable solution. Therefore, to increase the signal-to-noise ratio, the U-Net received a small circle (synthetic landmark) for each landmark, reinforcing the landmark as the target region for the U-Net. The synthetic landmarks taken by the U-Net generated the regions illustrated in Figure 11, after the training. As expected, these regions are not a perfect circle, contrary to the U-Net target masks shown in Figure 4.
a Cogn. Comput. 2022, 6, x FOR PEER REVIEW 9 of 21 ( Figure 8). Moreover, the U-Net input size needed to be static (400 × 400 inputs) and the image width-height ratio of the wing could not be altered in order to avoid pattern deformations. Therefore, the wing image was stacked on a black background where the horizontal axis of the wing image was scaled to the limits of that black layer ( Figure 10). During the U-Net training, it became evident that using only one pixel to represent each landmark was not a suitable solution. Therefore, to increase the signal-to-noise ratio, the U-Net received a small circle (synthetic landmark) for each landmark, reinforcing the landmark as the target region for the U-Net. The synthetic landmarks taken by the U-Net generated the regions illustrated in Figure 11, after the training. As expected, these regions are not a perfect circle, contrary to the U-Net target masks shown in Figure 4. When the U-Net segments the 19 landmarks, it does not identify each one in a differentiated manner. This is a critical issue because the classifier can only make the right decision if the order of the landmarks displayed in Figure 1b is kept during the processing runs. The first step to ensure a standardized landmarks extraction was based on Principal Components Analysis [40], which identifies the largest eigenvector that gives the angle of the landmarks' distribution. Then, the rotation of the entire mask following that angle allows having the landmarks rotationally aligned to the horizontal axis. This procedure simplified the mechanism of extracting the landmarks in a standard order because their patterns could be scanned in a consistent pose.
From the masks generated by the U-Net, it was then required to extract the mass center of the detected landmark regions. The most straightforward way to do this was to compute the center of the segmented regions using the Blob Detector implemented in the OpenCV library. The total number of expected landmarks (19) was used to validate segmentation; if that number was not confirmed, the process would fail. Following detection, all 19 landmarks were sorted out according to their positional relationship (Figure 1b) to enable accurate classification. This step was accomplished by placing the 19 When the U-Net segments the 19 landmarks, it does not identify each one in a differentiated manner. This is a critical issue because the classifier can only make the right decision if the order of the landmarks displayed in Figure 1b is kept during the processing runs. The first step to ensure a standardized landmarks extraction was based on Principal Components Analysis [40], which identifies the largest eigenvector that gives the angle of the landmarks' distribution. Then, the rotation of the entire mask following that angle allows having the landmarks rotationally aligned to the horizontal axis. This procedure simplified the mechanism of extracting the landmarks in a standard order because their patterns could be scanned in a consistent pose.
From the masks generated by the U-Net, it was then required to extract the mass center of the detected landmark regions. The most straightforward way to do this was to compute the center of the segmented regions using the Blob Detector implemented in the OpenCV library. The total number of expected landmarks (19) was used to validate segmentation; if that number was not confirmed, the process would fail. Following detection, all 19 landmarks were sorted out according to their positional relationship (Figure 1b) to enable accurate classification. This step was accomplished by placing the 19 landmarks in a list sorted in ascending order through the x-axis. After sorting, if there were values on the x-axis that were very close to each other (distance of 5 pixels), the Y-axis was used to eliminate the doubt. This approach ensured the standard positional order of the 19 landmarks portrayed in Figure 1b, which does not coincide with the order entered manually (Figure 1a).
The positional precision of the 19 landmarks was assessed by comparing the landmark patterns obtained from the U-Net output with the manually annotated landmarks (Figure 1). The precision of a given computed landmark was calculated inside the Procrustes space by comparing it with its corresponding ground-true landmark. The Euclidian distances between the landmarks' pairs were then summed up and normalized. The maximum precision of 1 was achieved when the sum of the distances was equal to zero.
The information content carried by each landmark was calculated using the wellknown information gain ratio criterion [41]. This criterion measures the uncertainty in how the data are separated based on a specific feature. The value of the information gain ratio was calculated for each feature, allowing assessment of their individual impact on subspecies classification.

Classification
The classification stage involved a Procrustes normalization (Figure 6), which allowed the ignoring of transposition, mirroring, rotation, and scale of the different landmark patterns [42]. The Generalized Procrustes Analysis (GPA) computed an invariant mean of landmark patterns generated from the training dataset. The linear projection of the new landmark pattern in the invariant space creates a pattern of information where the invariation rules are achieved.
Forewings classification was performed using a support vector machine (SVM) [43] by employing the Scikit-Learn [44] library. The SVM presents a high generalization capacity and is well-suited for small datasets, as is the case of dataset 2, used for training the classifier. Loading SVM with the geometric features invariant to translation, rotation, and scale was critical to achieving good classification results. After classification, the results were visualized and saved.

Wing Detector
The wing detector is essential to extract several wings from one image and normalize the image aspect, allowing for the stable loading of the CNN inputs. Several models of object detectors were trained on dataset 1 using transfer learning [45], which stabilized the training from the initial iterations. The criteria for model selection were accuracy and speed, which are both shown for each tested model in Table 1. While all trained models achieved mAP levels > 0.9, SSD MobileNet v1 FPN coco [34] was revealed to be the most accurate (0.975) and at the same time the fastest model (24 images per second). This was an expected result because MobileNet presents a simpler architecture that requires fewer images to be trained than other models. Moreover, its simpler architecture can be associated with its generalization capability, reducing the "curse of dimensionality". This generalization capability ensures correct wing detection even when the quality of wing images is poor. This characteristic of MobileNet will be critical for future users of the pipeline developed here as the images under analysis will likely exhibit a wide range of visual diversity (e.g., specular reflections, varying contrast) and artifacts (Figure 3).
Given the demonstrated superior performance, MobileNet was chosen for the final implementation of the system. This CNN-based wing detector proved to be essential for assuring the uniformity of the wing shape pattern at the entrance of the U-Net, allowing the extraction of several wings from the same image even when they exhibited different orientations.

Size of Synthetic Landmarks for Training
The size of the synthetic landmarks, denoted as circles, directly influenced the effectiveness of the U-Net at the training stage. Circles with a radius varying from one to five pixels were examined. The most appropriate circles had three and four pixels, with the latter performing better than the former ( Table 2). Circles with a radius smaller than three pixels were too small and were virtually ignored by the neural network. On the other hand, circles with a radius larger than four pixels were too large, compromising the positional accuracy of the detected points. Furthermore, in several cases, the large radius size led to the merging of different landmark regions into a single one, preventing the detection of all 19 landmarks.

U-Net Optimization
The U-Net was revealed to be an essential solution for detecting the 19 landmarks. While we tried classical object detectors (e.g., Faster R-CNN, YoloV3), these were incapable of returning good precision in finding the correct position of the landmarks. Furthermore, as a segmenter, the U-Net was designed to make pixel-oriented annotations, and that precision was well-suited to our system needs.
Several parameter changes were made to the U-Net to improve its functional performance. The original U-Net implementation uses convolution layer kernels of 3 × 3 elements, but that version had a greater focus on detecting edges [30]. Kernels of sizes 3 × 3, 5 × 5, and 7 × 7 were tested. Larger kernels could deliver better results due to a broader view of the landmark context. The best segmentation quality was obtained for size 5 × 5 ( Table 2). Although no new layers were added to the U-Net, changing the size of the kernels had a substantial impact on memory and speed performance. For instance, increasing the kernels from 3 × 3 to 5 × 5 made the neural network run three times slower.
The training results of the U-Net were further optimized by implementing different approaches, including (i) early stopping [46], (ii) reduction on plateau [39], and (iii) a weight loss function. The early stopping ends the model's training when the validation loss starts growing relative to the previous training iterations. By interrupting the process soon after the model converged (generalization state), it was possible to avoid overfitting. The reduction on plateau decreases the training learning rate when the loss figure stops improving. Implementation of these two approaches allowed a better level of convergence and landmark segmentation. The weight loss function, which allows differentiating the importance of different training classes, was implemented to emphasize the landmarks relative to the background. To illustrate the problem, a 400 × 400 image has 160,000 pixels and contains 19 circles with a radius of four pixels. Therefore, only 955 of those pixels constitute landmarks to be segmented, representing <1% of the total pixels. Accordingly, weighting the two classes by using the weight loss function was crucial for achieving segmentation success. Several weight configurations were tested. The one that produced the best results kept the weights of the background class at one and increased by 50 the weight of the landmark class.

Evaluation of Landmarks Segmentation
In this section, the parametrization and the results that produce a suitable generalization regarding the landmarks segmentation are analyzed regarding (i) the segmentation capabilities of the U-Net (ensuring that the 19 landmarks are detected), (ii) the positional precision of the 19 landmarks, and (iii) the processing robustness in dealing with image dust and noise. Table 2 shows the capability of the U-Net in annotating exactly the 19 landmarks, using several configurations in the validation dataset. Radius represents the radius in pixels of the landmarks used during the training. Altered images indicate whether the images were rotated or displaced during the process of data augmentation in the training dataset. Dust corresponds to artifacts and noise artificially added to the images. Kernel corresponds to the U-Net kernel size. Weights represent the use of weight loss function during training. Accuracy indicates the success of the U-Net model in detecting all 19 landmarks. A summary of a larger grid of parameter combinations is shown in Table 2. Lines 1 and 2 show an improvement in the U-Net performance when the circle radius is increased from 2 to 3. Lines 2 and 3 show the improvement resulting from data augmentation regarding affine geometric transformations (Altered Images). Lines 3 and 4 reveal the benefit of using a kernel size of 5 × 5 versus 3 × 3. The improvement achieved from adding noise is evidenced when comparing lines 4 and 5. A comparison between lines 5 and 6 underscores the benefit of using the weight loss function to train the U-Net. Lines 6, 7 and 8 show that the radius of size four outperforms the radius of size three and that the kernel size of 5 × 5 outperforms 7 × 7. When combining the best parameters, the accuracy of landmark segmentation reaches 91.8%.
The U-Net model learned how to handle problems of dust and angle modifications, as illustrated for several examples in Figure 12a,b. However, when the images were excessively corrupted by visual artifacts, some landmarks went undetected (Figure 12c). In addition to distorting the detection of landmarks, artifacts in the image may also create false positives.
The positional precision of the 19 computed landmarks was assessed on the testing dataset by comparing the U-Net-generated landmarks with the manually annotated landmarks (Figure 1). Obtaining a high-precision landmarks segmenter is critical for honey bee classification when using wing geometric morphometrics. This is because the locations of the 19 landmarks are subspecies specific, as illustrated in Figure 1c,d for two genetically close subspecies [5], and any positioning error will impact classification accuracy. As shown in Table 3, precision varied among landmarks, with the lowest value obtained for landmark 9 (0.900) and the highest for landmark 19 (0.975). Curiously, landmark 19 is the single one located outside of a vein junction. The average precision obtained for the 19 computed landmarks was 0.943 ± 0.020.
While all 19 landmarks are used by the subspecies classifier, the information content carried by each one is variable. Table 4 shows the information gain ratio obtained for each x and y input feature associated with the 19 landmark coordinates. The most important features (information gain ratio > 0.2) for subspecies classification were 13-x, 17-y, 15-y, 13-y, and 8-y, which exhibited a precision ranging from 0.933 (landmark 15) to 0.958 (landmark 17; Table 3). Interestingly, these four landmarks are implicated in the calculations of the cubital index, hantel index, and discoidal shift angle, which are used by many queen breeders engaged in A. m. mellifera conservation [14]. This finding suggests that the pipeline is using, with good precision, the features that are well-known as correlated to the subspecies. Moreover, the use of wing landmarks in honey bee classification enables a very affirmative criterion for automatically excluding images, as classification is aborted when the number of extracted landmarks is different from 19. The positional precision of the 19 computed landmarks was assessed on the testing dataset by comparing the U-Net-generated landmarks with the manually annotated landmarks (Figure 1). Obtaining a high-precision landmarks segmenter is critical for honey bee classification when using wing geometric morphometrics. This is because the locations of the 19 landmarks are subspecies specific, as illustrated in Figure 1c,d for two genetically close subspecies [5], and any positioning error will impact classification accuracy. As shown in Table 3, precision varied among landmarks, with the lowest value obtained for landmark 9 (0.900) and the highest for landmark 19 (0.975). Curiously, landmark 19 is the single one located outside of a vein junction. The average precision obtained for the 19 computed landmarks was 0.943 ± 0.020.   In summary, the essays carried out here, using complex images from dataset 2 and images from external datasets (data not shown), revealed a high capacity of the U-Net for generalization in detecting the landmarks. The U-Net model was not only capable of individually detecting each landmark but also capable of relating the landmarks in a pattern, enabling inference of missing landmarks. Furthermore, the U-Net showed precision capability and robustness in dealing with new and visually corrupted images. This study provides new insights into machine learning research by offering an alternative solution for problems demanding annotation of landmarks with a high level of precision (e.g., reference points in aerial images, facial points used for biometric recognition, markers at body joints employed for motion acquisition, anatomical landmarks in medical images, fingerprint minutiae for person recognition). To the best of our knowledge, our approach is unique in the usage of the U-Net for the segmentation of landmarks. Remarkably, the U-Net allowed a precise and robust (regarding the noise) segmentation process when using a kernel of 5 × 5 and landmark masks with a radius of 4 pixels. Furthermore, employing a weight cost function was revealed to be essential to emphasize the landmarks relative to the background, further contributing to the success of the final solution.

Classification
The classifier receives the image features already treated and simplified in the previous stages of the pipeline ( Figure 6). For instance, the U-Net translates the wing image into 19 stable data points, and the Procrustes method ensures positional independence (translation, scale, rotation), to avoid a training dataset with those variations and allow future classification of images with a range of variations.
The images were classified using SVM [43], a model that does not need a large hyperparametrization, simplifying the training phase. The validation dataset was employed to find the SVM configuration that returned the most accurate classification. The SVM C factor was 30, the gamma value was 0.17, and the best kernel was the RBF (Radial Base Function). The model input consisted of the 38 values (x, y landmarks coordinates) normalized by the Procrustes method. The model output codified the subspecies under classification. The final SVM configuration was able to classify the forewings with 86.6% ± 6.9 average (±SD) accuracy across the 26 subspecies represented in the testing dataset (Table 5), despite the relatively small number of images used in the classification training and the low quality of many of them (Figure 3). Nawrocka and colleagues [8] found a similar average accuracy (88.4%) for a similar manually annotated dataset (25 instead of 26 subspecies of the Oberursel collection), further validating the performance of the pipeline developed herein. These authors [8] also employed the geometric morphometrics method, although they classified the 25 subspecies using the linear canonical variate analysis as opposed to the non-linear SVM approach. In another study, Da Silva and colleagues [7] used manually annotated landmarks on the Oberursel wing images to compare seven different classifiers, including SVM. Unexpectedly, the authors reported poor performance of the SVM classifier (60.04%), as compared to this study, with the best classifier, Naïve Bayes, achieving only 65.15% accuracy in cross-validation. At the subspecies level, the lowest accuracy was observed for the African A. m. litorea (60.9%) and the highest for the eastern European A. m. cecropia (96.4%; Table 5). At the lineage level, the lowest average accuracy was observed for the African (75.0%) and the highest for the western European (92.2%; Table 5). The poorer performance of the classifier for African subspecies is consistent with the findings of [8] and is explained by the closer morphological proximity among subspecies from central and southern Africa. Accordingly, when the subspecies of African ancestry were excluded from the analysis, classification accuracy increased up to 90.5% ± 1.7.
The performance of the pipeline was further improved (95.8% accuracy; Table 6) by training another model with only five subspecies chosen for their commercial value (A. m. ligustica, A. m. carnica and A. m. caucasia) or conservation status (A. m. mellifera and its sister A. m. iberiensis) [47]. Except for A. m. iberiensis, the remaining four subspecies were amongst the best represented in dataset 2, used for classification training, with A. m. carnica having the largest number of wing images (n = 150). While excluding subspecies of African ancestry led to greater classification accuracy (this study and [8]), it is possible that the improved performance allowed by the five subspecies model is also related to the larger sample size used during training. Although not directly comparable with our system, the end-to-end solution developed from the entire wing (as opposed to the landmarks) by De Nart and colleagues [9] achieved accuracy values of 99% in cross-validation from a much larger (n = 9887) proprietary wing image dataset representing one hybrid and seven subspecies (A. m. ligustica, A. m. carnica, A. m. caucasia, A. m. anatoliaca, A. m. siciliana, A. m. iberiensis, A. m. mellifera) using CNN.
Training the system with a larger number of wing images per subspecies may lead to a more accurate classification. Yet, this effort can only be achieved with wing images obtained from newly collected specimens outside of the Oberursel collection, as was carried out by De Nart and colleagues [9]. The problem is that this effort involves the identification of the specimens using the full set of morphological traits, which is a time-consuming endeavor requiring expert knowledge that is not always available [5]. Moreover, the classification pipeline developed herein was based on the explicit choice of training the system with the original wing collection used by Ruttner [1] in the delineation of the A. mellifera subspecies, in spite of the low number of wing images and their often-low quality. Nonetheless, if needed, the system can be trained with new examples to improve classification rates on all types of wing images.

Computational Cost Analysis
The final implementation of the system, using a CNN MobileNet, a U-Net, and a SVM, presented a good speed performance, requiring 14 s to process 10 images, which is the minimum number recommended for colony-level identification [5]. The computational machine was based on an i5-9400@2.9 GHz six cores CPU and 16 GBytes of main memory. As the implemented solution is based on threads, the overall speed performance could be increased by using a CPU with a larger number of cores. The CPU clock also directly influences the speed performance because the main processing demands mathematical vector operations that are correlated to the vectorial instruction speed. The deep learning framework allows for running the code on a GPU to achieve an overall velocity of about 15 times as compared to the CPU.

DeepWings© as a Web Service
The pipeline developed herein for honey bee subspecies identification was registered as software named DeepWings© (Registo de obra n. • 3214/2019, Inspeção-Geral das Atividades Culturais). DeepWings© is implemented as a free Web service available at the URL https://deepwings.ddns.net, accessed on 26 April 2022. The Flask framework was used to program the Web service, which was constructed with parallel programming based on threads.
DeepWings© is a user-friendly software that only requires dragging wing images into a file drop zone (Figure 13a). After image processing, a table containing the probabilities of the top three subspecies is built on the Web page ( Figure 13b). In addition to classification, DeepWings© computes several wing geometric parameters and the coordinates of the landmarks (Figure 13c). Classification probabilities, geometric parameters (cubital index, hantel index, discoidal shift angle) and landmark coordinates can be downloaded as excel files, facilitating data storage for further analysis or alternative applications. For instance, the coordinates of the landmarks can be used directly by other identification software, such as MorphoJ [48] or IdentiFly [26], or to calculate angles and lengths required by other methods such as the DAWINO or classical wing morphometry [5]. DeepWings© can therefore be used in conjunction with measurements of other body traits (e.g., pilosity, pigmentation, proboscis and lengths) for purposes requiring more intensive methods, such as identification of new subspecies [2][3][4][5]. However, more importantly, beekeepers and queen breeders now have a friendly tool for identifying their colonies for conservation or commercial purposes.

Conclusions
The software DeepWings© developed here for the identification of honey bees by wing geometric morphometrics, combines a CNN as the wing detector, a deep learning U-Net as the landmarks segmenter, and a SVM as the subspecies classifier. This multistep solution was revealed to be better suited than an end-to-end solution for our problem for three main reasons. First, by using models of low complexity (U-Net and SVM), it was possible to train the system with the small wing collection used by Ruttner [1] to identify the honey bee subspecies. Second, it allowed the neural model to search for the features (landmarks) that are used by the gold-standard method in honey bee classification based on wings: geometric morphometrics. Third, by introducing the Procrustes method between the landmark detection step and the classification step, the classifier could be trained with greater robustness, as the wing invariances (translation, rotation, and scale) do not need to be learned. Despite the apparent complexity, the system showed good speed performance, requiring 14 s to process 10 images, which is the minimum number recommended for colony-level identification [5].
While there is another wing geometric morphometrics tool (IdentiFly) available for honey bee classification, DeepWings© is the first to do so in a fully automated manner. The scientific novelty and greatest contribution of our solution to honey bee classification It is increasingly recognized that sustainable beekeeping requires the use of native subspecies, as they are better adapted to local environments and show superior performance when compared with exotic subspecies [49]. The problem is that the genetic integrity of many honey bee subspecies is threatened after many generations of importation of exotic queens [47]. A. m. mellifera is the best example of such a situation, as in large tracts of its native distribution, this subspecies is severely introgressed or is even on the brink of extinction. This has led to increasing demand for native subspecies, especially for A. m. mellifera, and consequently for identification tools that can be used to certify the origin of the queens. Such certification is required for beekeepers for (i) moving their colonies to conservation areas, (ii) monitoring the efficiency of isolated mating stations, and (iii) receiving subventions according to local or European legislation. Additionally, beekeepers may be able to increase the market value of their stock by certifying the queens' origin. Highly accurate subspecies identification implies measuring 36 morphological traits [5] or genotyping genome-wide molecular markers [28]. However, these methods are not accessible to most beekeepers and queen breeders, as the former is too laborious and time-consuming and the latter is still too expensive. DeepWings© offers an alternative solution for the above applications, which often do not require the accuracy levels provided by those morphological or molecular methods.

Conclusions
The software DeepWings© developed here for the identification of honey bees by wing geometric morphometrics, combines a CNN as the wing detector, a deep learning U-Net as the landmarks segmenter, and a SVM as the subspecies classifier. This multi-step solution was revealed to be better suited than an end-to-end solution for our problem for three main reasons. First, by using models of low complexity (U-Net and SVM), it was possible to train the system with the small wing collection used by Ruttner [1] to identify the honey bee subspecies. Second, it allowed the neural model to search for the features (landmarks) that are used by the gold-standard method in honey bee classification based on wings: geometric morphometrics. Third, by introducing the Procrustes method between the landmark detection step and the classification step, the classifier could be trained with greater robustness, as the wing invariances (translation, rotation, and scale) do not need to be learned. Despite the apparent complexity, the system showed good speed performance, requiring 14 s to process 10 images, which is the minimum number recommended for colony-level identification [5].
While there is another wing geometric morphometrics tool (IdentiFly) available for honey bee classification, DeepWings© is the first to do so in a fully automated manner. The scientific novelty and greatest contribution of our solution to honey bee classification is related to the capability of DeepWings© to segment the wings from images containing varying artifacts, varying numbers of wings, and wings with different orientations, and then segment the landmarks with high precision.
Since the wing shape patterns differ slightly among subspecies, particularly when they belong to the same lineage, it became critical to use a segmenter that would allow high precision in detecting the 19 landmarks. This goal was achieved by using the U-Net, which showed good performance even when the images were very noisy.
The use of SVM in the classification was revealed to be a good solution with suitable generalization, given the weak separation among subspecies (particularly those of African ancestry) and the low number of images available for classification training (26 subspecies represented by only 1864 images). Despite these limitations, classification accuracy was 86.6% for the 26 subspecies and increased to 95.8% when the classifier was trained with only five subspecies. Higher accuracy would have been expected had the dataset used in the classification training been larger. However, despite the low number of wing images contained in the Oberursel collection and their often-low quality, we explicitly chose to train our system with the dataset that was originally used by Ruttner [1] in the delineation of the A. mellifera subspecies.
DeepWings© is available as free software for use in colony identification for multiple purposes (e.g., monitoring isolated mating stations, selection of queens in conservation apiaries) by beekeepers, queen breeders, and even scientists. In addition to classification, DeepWings© provides the coordinates of the 19 landmarks, and these can be processed by other software, facilitating data exchange between different scientific studies and research teams. Data Availability Statement: Publicly available datasets were analyzed in this study. These data can be found here: https://github.com/walterBSG/Beeapp-landmark-detection, accessed on 26 April 2022.