Detection and Recognition of Pollen Grains in Multilabel Microscopic Images

Analysis of pollen material obtained from the Hirst-type apparatus, which is a tedious and labor-intensive process, is usually performed by hand under a microscope by specialists in palynology. This research evaluated the automatic analysis of pollen material performed based on digital microscopic photos. A deep neural network called YOLO was used to analyze microscopic images containing the reference grains of three taxa typical of Central and Eastern Europe. YOLO networks perform recognition and detection; hence, there is no need to segment the image before classification. The obtained results were compared to other deep learning object detection methods, i.e., Faster R-CNN and RetinaNet. YOLO outperformed the other methods, as it gave the mean average precision (mAP@.5:.95) between 86.8% and 92.4% for the test sets included in the study. Among the difficulties related to the correct classification of the research material, the following should be noted: significant similarities of the grains of the analyzed taxa, the possibility of their simultaneous occurrence in one image, and mutual overlapping of objects.


Introduction
According to the White Book on Allergy, approximately 30-40% of the world's population suffer from an allergic disease, and the incidence of this illness is constantly increasing [1]. One of the main causes of allergies is the pollen grains of wind-pollinated plants. The presence of pollen grains in the air is seasonal and related to the flowering time. Pollen seasons are highly variable in individual years, especially for trees blooming in early spring, when the weather conditions are unstable. This entails the necessity of constant local monitoring of pollen, which provides relevant information to allergists and their patients about the risk of allergenic pollen.
The standard method for pollen monitoring is the Hirst design volumetric spore trap [2]. Currently, two brands of this equipment are available on the market-Lanzoni s.r.l. (Bologna) from Italy and Burkard Manufacturing Co. Ltd. (Rickmansworth) from Great Britain. Samplers work continuously and prepare hourly and daily data [3,4]. They have a drum moving at 2 mm/h inside the trap. A transparent adhesive-coated tape is wound around the drum [4]. The air with bioaerosol particles is sucked in through a narrow orifice and then directed to the sticky surface. The airflow is 10 L/min, which is a value close to the volume of air inhaled by an adult in the process of breathing. The tape is replaced exactly at the same time after a week's exposure. Then, in the laboratory, the tape with the glued biological material is cut into sections corresponding to 24 h, thus providing the basis for microscopic preparations. The qualitative and quantitative analysis In our research, YOLOv5 was applied for the detection of pollen grains of three taxa (Alnus, Betula, and Corylus) in microscopic images. These strongly allergenic taxa are common in Central and Eastern Europe. The results from models obtained on the basis of two YOLOv5 releases, as well as different dataset variants, were compared. To eliminate bias in the model training process, we created our dataset from reference images, which contained grains of only one taxon. This solution allowed us to limit the workload of the palynologist. The detection results indicate the high accuracy of the generated models.

Materials and Methods
The most frequent species in Poland were selected for this research: Betula verrucosa Ehrh (syn. B. pendula Roth), Corylus avellana L., and Alnus glutinosa (L.) Gaertn. [21]. Reference materials were prepared by spreading pollen grains directly from the catkin onto a microscope slide. The preparations were then closed with glycerinated gelatin.
Slides with pollen grains were analyzed using an Eclipse E400 light microscope (Nikon, Tokyo, Japan) at a magnification of 600×. Photographs of pollen grains were taken using the HDCE-x5 microscope camera.

Dataset Preparation
The available microscopic image database was split into a training set containing 265 images, a validation set of 114 photos, and a test set with 49 samples. Examples of photos used for training (for each taxon) are presented in Figure 1. Sixteen images with too many overlapping objects and many defocused grains were excluded. They were taken for an additional test to check the quality of the obtained models in the case of challenging pictures to be recognized.
images per type. This set covers pollen from trees and grasses sampled in Graz (Austria) using a novel cyclone-based particle collector [20].
In our research, YOLOv5 was applied for the detection of pollen grains of three taxa (Alnus, Betula, and Corylus) in microscopic images. These strongly allergenic taxa are common in Central and Eastern Europe. The results from models obtained on the basis of two YOLOv5 releases, as well as different dataset variants, were compared. To eliminate bias in the model training process, we created our dataset from reference images, which contained grains of only one taxon. This solution allowed us to limit the workload of the palynologist. The detection results indicate the high accuracy of the generated models.

Materials and Methods
The most frequent species in Poland were selected for this research: Betula verrucosa Ehrh (syn. B. pendula Roth), Corylus avellana L., and Alnus glutinosa (L.) Gaertn. [21]. Reference materials were prepared by spreading pollen grains directly from the catkin onto a microscope slide. The preparations were then closed with glycerinated gelatin.
Slides with pollen grains were analyzed using an Eclipse E400 light microscope (Nikon, Tokyo, Japan) at a magnification of 600×. Photographs of pollen grains were taken using the HDCE-x5 microscope camera.

Dataset Preparation
The available microscopic image database was split into a training set containing 265 images, a validation set of 114 photos, and a test set with 49 samples. Examples of photos used for training (for each taxon) are presented in Figure 1. Sixteen images with too many overlapping objects and many defocused grains were excluded. They were taken for an additional test to check the quality of the obtained models in the case of challenging pictures to be recognized.

Data Preprocessing and Labeling
The annotated database of the microscopic photos of pollen grains was prepared following relevant guidelines [22], and made publicly available as ABCPollenOD. The microscopic images were cropped to show only the whole grains without grain fragments. Object labeling for detection consists of marking the object's location using a bounding box and marking the class it belongs to. The dataset was annotated twice: (1) all pollen grains belonging to the classes related to the studied taxa (DBAll) were labeled; (2) only pollen grains identified by a palynologist (DBVisible) without any doubts were marked. All co-authors verified the annotations to provide reliable ground-truth bounding boxes.

Data Preprocessing and Labeling
The annotated database of the microscopic photos of pollen grains was prepared following relevant guidelines [22], and made publicly available as ABCPollenOD. The microscopic images were cropped to show only the whole grains without grain fragments. Object labeling for detection consists of marking the object's location using a bounding box and marking the class it belongs to. The dataset was annotated twice: (1) all pollen grains belonging to the classes related to the studied taxa (DBAll) were labeled; (2) only pollen grains identified by a palynologist (DBVisible) without any doubts were marked. All co-authors verified the annotations to provide reliable ground-truth bounding boxes.

Investigated Models
YOLO is a general-purpose detector with an ability to detect a variety of objects simultaneously. It divides the image into a grid of cells and predicts bounding boxes, confidence for these boxes, and class probabilities for each cell. In [23], Redmon et al. introduced the first version of the YOLO technique, in which each grid cell predicts only two boxes and has one class. The second version of YOLO was introduced in [24]. A new classification model-Darknet-19-is here used as the base. YOLOv2 is better and faster than its predecessor. The main improvement is the use of the concept of anchor boxes. In 2018, Redmon and Farhadi introduced the third version of YOLO in their paper [25]. This version is more accurate than the earlier ones but slightly larger. YOLOv3 was the last version created by Redmon. It consisted of 75 convolutional layers without fully connected or pooling layers, which significantly reduced the model size and weight. Darknet-53 was used this time as a backbone architecture.
In 2020, three versions of YOLO were developed by different creators: YOLOv4 by Bochkovskiy et al. presented in [26], YOLOv5 released in 2020 by Jocher and Ultralytics Company [27] as a PyTorch implementation of YOLOv3, and PP-YOLO created by Long et al. [28]. YOLOv4 has the same backbone as YOLOv3 but introduces the concepts of the bag of freebies and the bag of specials. The bag of freebies contains, among others, some data augmentation techniques, such as Cutmix (cut and mix multiple images), MixUp (random mixing of images), and Mosaic data augmentation. An example of the bag of specials is the non-max suppression (NMS), which reduces false boxes in the case of multiple bounding boxes predicted for grouped objects. In YOLOv5, some technical changes have been made, e.g., better data augmentation, improvement of loss calculations, and autolearning of anchor boxes. The Mosaic Dataloader, which is a new concept developed by Ultralytics and first featured in YOLOv4, is used for model training. The models of YOLOv5 pretrained on the COCO database are freely available in several releases. PP-YOLO is an object detector based on YOLOv3 and PaddlePaddle [29], where ResNet [30] was used as a backbone and data augmentation was achieved by MixUp.
In the structure of the YOLO network, only convolutional layers are used, which makes it a fully convolutional neural network. The main parts of YOLO are presented in Figure 2. The input image features are compressed through a feature extractor (backbone). The detection neck is a feature aggregator that combines and mixes features formed in the backbone and then forwards them to the detection head [31].

Investigated Models
YOLO is a general-purpose detector with an ability to detect a variety of objects simultaneously. It divides the image into a grid of cells and predicts bounding boxes, confidence for these boxes, and class probabilities for each cell. In [23], Redmon et al. introduced the first version of the YOLO technique, in which each grid cell predicts only two boxes and has one class. The second version of YOLO was introduced in [24]. A new classification model-Darknet-19-is here used as the base. YOLOv2 is better and faster than its predecessor. The main improvement is the use of the concept of anchor boxes. In 2018, Redmon and Farhadi introduced the third version of YOLO in their paper [25]. This version is more accurate than the earlier ones but slightly larger. YOLOv3 was the last version created by Redmon. It consisted of 75 convolutional layers without fully connected or pooling layers, which significantly reduced the model size and weight. Darknet-53 was used this time as a backbone architecture.
In 2020, three versions of YOLO were developed by different creators: YOLOv4 by Bochkovskiy et al. presented in [26], YOLOv5 released in 2020 by Jocher and Ultralytics Company [27] as a PyTorch implementation of YOLOv3, and PP-YOLO created by Long et al. [28]. YOLOv4 has the same backbone as YOLOv3 but introduces the concepts of the bag of freebies and the bag of specials. The bag of freebies contains, among others, some data augmentation techniques, such as Cutmix (cut and mix multiple images), MixUp (random mixing of images), and Mosaic data augmentation. An example of the bag of specials is the non-max suppression (NMS), which reduces false boxes in the case of multiple bounding boxes predicted for grouped objects. In YOLOv5, some technical changes have been made, e.g., better data augmentation, improvement of loss calculations, and autolearning of anchor boxes. The Mosaic Dataloader, which is a new concept developed by Ultralytics and first featured in YOLOv4, is used for model training. The models of YOLOv5 pretrained on the COCO database are freely available in several releases. PP-YOLO is an object detector based on YOLOv3 and PaddlePaddle [29], where ResNet [30] was used as a backbone and data augmentation was achieved by MixUp.
In the structure of the YOLO network, only convolutional layers are used, which makes it a fully convolutional neural network. The main parts of YOLO are presented in Figure 2. The input image features are compressed through a feature extractor (backbone). The detection neck is a feature aggregator that combines and mixes features formed in the backbone and then forwards them to the detection head [31]. Data augmentation methods help to improve the model accuracy. In YOLOv5, the methods that cut and mix images containing objects of different classes produce multiclass examples, even if the input database consists of single-class samples. Therefore, YOLO with data augmentation is an appropriate method to handle microscopic images taken from reference biological material. Here, no specialist needs to be involved in the annotation process.
Two releases of YOLOv5 were chosen: YOLOv5s (small) and YOLOv5l (large). They differ in the number of layers and parameters and in the values of the initial hyperparameters. Both of these networks were pre-trained in 300 epochs on the COCO dataset [32] and are freely available. Data augmentation methods help to improve the model accuracy. In YOLOv5, the methods that cut and mix images containing objects of different classes produce multiclass examples, even if the input database consists of single-class samples. Therefore, YOLO with data augmentation is an appropriate method to handle microscopic images taken from reference biological material. Here, no specialist needs to be involved in the annotation process.
Two releases of YOLOv5 were chosen: YOLOv5s (small) and YOLOv5l (large). They differ in the number of layers and parameters and in the values of the initial hyperparameters. Both of these networks were pre-trained in 300 epochs on the COCO dataset [32] and are freely available. For each release, the following four models were built by fine-tuning the initial models and taking into consideration two types of database labeling: (1) ModelVis-trained and validated on the DBVisible set; (2) ModelAll-trained and validated on the DBAll set; (3) ModelAllVis-trained on DBAll and validated on DBVisible; (4) ModelVisAll-trained on DBVisible and validated on DBAll.

Training Procedure
The training of the investigated models was performed in 500 epochs at the most. The model training was stopped earlier in several cases, as no improvement was observed in the last 100 epochs. The following performance measures were calculated to evaluate the model in each training epoch: precision, recall, mAP@.5, and mAP@.5:.95.
The precision is a set-based evaluation metric. It is the ratio of true positives to the total number of objects assigned by the system to the target class. It is calculated for each class separately based on true and predicted values. The recall is the proportion of correctly classified instances of the target class to the total number of objects belonging to this class in the evaluation set for each class. The precision-recall curve can be plotted to visualize both these measures. The average precision (AP) is calculated as an area under this curve for each class separately.
In the Pascal VOC object detection challenge [31], the mean average precision measure is used for model evaluation purposes. This metric takes into consideration the fitness of the predicted bounding boxes to the actual localizations of objects. The fitness measure is defined as intersection over union (IoU). The mAP@.5 measure is the mean average precision over all the classes. @.5 means that only correctly predicted objects in bounding boxes with IoU above 0.5 are taken as positives.
In the COCO challenge, which consists of the best detection of all objects in the COCO dataset, the mAP@.5:.95 is the primary evaluation metric. It uses ten equally spaced IoU threshold values: from 0.5 to 0.95, with a step of 0.05.
The evaluation of our models was based on the fitness measure, which is a weighted mean value of mAP@.5:.95 and mAP@.5 (with weights 0.9 and 0.1, respectively) [31]. The entire training procedure was repeated three times to compare the stability of the results. The final results were presented as an arithmetic mean and standard deviation values of mAP@.5:.95 for three repetitions.

Test Datasets
Besides the test set derived from DBAll and DBVisible (testAll and test Visible, respectively), testing was also done on the testDiff, which contains images excluded previously from the database due to the large number of overlapping and blurred objects. Additionally, the testMix set was prepared, with 25 images created from parts of original photos from the test set, to achieve sample images with grains of various taxa. This dataset allowed us to check whether models learned on one-class examples from reference materials can recognize objects in multi-labeled pictures.

Other Detection Networks Used for Comparison
The Faster R-CNN and RetinaNet deep neural networks were applied for comparison of the YOLO results. These detectors were trained and tested on the DBAll set. We chose only one model with the highest mAP@.5:.95 values to check whether the results of these two networks could exceed the YOLO results.

RetinaNet
The one-stage RetinaNet network architecture uses a Feature Pyramid Network (FPN) backbone on top of a feedforward ResNet architecture to generate a multi-scale convolutional feature pyramid. RetinaNet attaches two subnetworks: one for classifying anchor boxes and one for regressing from anchor boxes to ground-truth object boxes. The focal loss function enables RetinaNet to achieve accuracy at the level of two-stage detectors such as Faster R-CNN with FPN while running at faster speeds [33].

Faster R-CNN
Faster R-CNN (region-based convolutional neural network) was the first network to combine features for region proposal with object classification [14]. It is composed of two modules. The first module is a deep, fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector [13].
Although YOLOv2 gives comparatively low recall and more localization errors than Faster R-CNN [24], we decided to apply the YOLO network in our task due to the following advantages: fewer background mistakes than in Fast R-CNN [23] and understanding generalized object representation [24]. Moreover, one the main advantages of YOLO is the high speed of the interference process, which was not very important in this task.
All calculations were performed using the free cloud notebook environment Google Colaboratory [34].

Results
The training procedure was run three times and yielded twenty-four models. The average values and standard deviation of mAP@.5:.95 are presented in Figure 3. Very satisfactory average results of 88-92% were obtained, except for the test on test_diff with an average recognition of 74-81%. The evaluation of the 24 models on the testAll, testVisible, and testMix ranged from 86.8% to 92.4%. More than half of them reached the value of 90%. Detailed results for all models are shown in Table 1.
focal loss function enables RetinaNet to achieve accuracy at the level of two-stage detectors such as Faster R-CNN with FPN while running at faster speeds [33].

Faster R-CNN
Faster R-CNN (region-based convolutional neural network) was the first network to combine features for region proposal with object classification [14]. It is composed of two modules. The first module is a deep, fully convolutional network that proposes regions, and the second module is the Fast R-CNN detector [13].
Although YOLOv2 gives comparatively low recall and more localization errors than Faster R-CNN [24], we decided to apply the YOLO network in our task due to the following advantages: fewer background mistakes than in Fast R-CNN [23] and understanding generalized object representation [24]. Moreover, one the main advantages of YOLO is the high speed of the interference process, which was not very important in this task.
All calculations were performed using the free cloud notebook environment Google Colaboratory [34].

Results
The training procedure was run three times and yielded twenty-four models. The average values and standard deviation of mAP@.5:.95 are presented in Figure 3. Very satisfactory average results of 88-92% were obtained, except for the test on test_diff with an average recognition of 74-81%. The evaluation of the 24 models on the testAll, testVisible, and testMix ranged from 86.8% to 92.4%. More than half of them reached the value of 90%. Detailed results for all models are shown in Table 1.   The comparison of the YOLOv5l and YOLOv5s models shows that both releases give similar results. The YOLOv5s models are much smaller, can be built faster, and do not require high-performance computational resources.
Models built on different datasets give similar results when evaluated on testAll, testVisible, and testMix. An example of an image with correct and incorrect recognition for the testMix is presented in Figure 4. As indicated by the recognition results on the testDiff set, the models trained on DBAll: ModelAll and ModelAllVis exhibit much better performance. Figure 5 presents an example of an image from this set together with detection results for two of the models built based on release YOLOv5s: ModelAll ( Figure 5A) and ModelVisAll ( Figure 5B). Blurred grains are ignored by ModelVisAll and correctly indicated by ModelAll. The presented example also shows that ModelAll more often indicates objects that are not pollen grains. However, it is worth noting that these images are challenging even for specialists.  The comparison of the YOLOv5l and YOLOv5s models shows that both releases give similar results. The YOLOv5s models are much smaller, can be built faster, and do not require high-performance computational resources.
Models built on different datasets give similar results when evaluated on testAll, testVisible, and testMix. An example of an image with correct and incorrect recognition for the testMix is presented in Figure 4. As indicated by the recognition results on the testDiff set, the models trained on DBAll: ModelAll and ModelAllVis exhibit much better performance. Figure 5 presents an example of an image from this set together with detection results for two of the models built based on release YOLOv5s: ModelAll ( Figure 5A) and ModelVisAll ( Figure 5B). Blurred grains are ignored by ModelVisAll and correctly indicated by ModelAll. The presented example also shows that ModelAll more often indicates objects that are not pollen grains. However, it is worth noting that these images are challenging even for specialists.    The measure mAP@.5:.95 allows validation of the classification with the assessment of the fitness of the bounding box to the detected object. In the case of monitoring, it is crucial to count the occurrences of all pollen grains of a given taxon, but their precise location is not a subject of importance. The prediction results are expressed by the performance measures described above, namely precision, recall, mAP.5, and mAP@.5:.95. The average precision and recall calculated from three repetitions separately for the testAll, testVisible, and testMix datasets are presented in Table 2. According to the model, average precision ranging from 91.7% to 97.8% was achieved, while the average recall values ranged from 89.7% to 98.9%. Confusion matrices allow the presentation of the recognition of each class separately. In Table 3, the averaged results of all twenty-four models tested on the testMix dataset are shown. The Alnus and Corylus pollen grains are rarely incorrectly detected. In particular, Alnus grains were never misclassified as Betula. The weakest recognition accuracy was noted in the case of the birch pollen grains-nearly 10% were classified as alder and 10% were recognized as hazel. The measure mAP@.5:.95 allows validation of the classification with the assessment of the fitness of the bounding box to the detected object. In the case of monitoring, it is crucial to count the occurrences of all pollen grains of a given taxon, but their precise location is not a subject of importance. The prediction results are expressed by the performance measures described above, namely precision, recall, mAP.5, and mAP@.5:.95. The average precision and recall calculated from three repetitions separately for the testAll, testVisible, and testMix datasets are presented in Table 2. According to the model, average precision ranging from 91.7% to 97.8% was achieved, while the average recall values ranged from 89.7% to 98.9%. Confusion matrices allow the presentation of the recognition of each class separately. In Table 3, the averaged results of all twenty-four models tested on the testMix dataset are shown. The Alnus and Corylus pollen grains are rarely incorrectly detected. In particular, Alnus grains were never misclassified as Betula. The weakest recognition accuracy was noted in the case of the birch pollen grains-nearly 10% were classified as alder and 10% were recognized as hazel. It is easy to notice that the mAP@.5:.95 values obtained with YOLOv5 on the testAll set exceed those obtained with RetinaNet and Faster R-CNN ( Table 4). The results on other test sets were similar. Therefore, we did not carry out any further tests on these detectors.

Discussion
Various monitoring methods produce different results, and it is not easy to decide which method is the most reliable. The traditional volumetric method used in Poland assumes the analysis of certain sections of collected material under a microscope and normalization of the achieved value to the number of grains contained in 1 m 3 of air. The quantitative analysis is affected by the error associated with the randomness of samples and the palynologist's experience. Therefore, 100% recall is not required for the automatic recognition system, because specialists do not count the grain if they are not sure to which taxon it belongs. This is why 80% accuracy of detection from the monitored material seems to be satisfactory. Our research is based on detection from reference material; hence, we expect better results.
This study is a continuation of our previous work that attempted to create classification models based on deep learning [11]. The previous study was focused only on the identification of a taxon, which is crucial in pollen monitoring. The current research is the next step towards creating an automated taxon recognition system from material collected by Hirst-type traps.
In [20], a system for the automatic creation of a digital database [19] containing microscopic photos in the form of video sequences was proposed. The research was based on the reference material; however, the preparation of slides was different than in our study. The authors proposed a new detection system working "in the wild" with high accuracy. Our research focused on improving the monitoring system currently applied in monitoring stations in Poland based on volumetric spore traps. Therefore, the previous and new results can be compared; comparison between different locations is also possible. This is all especially important as the results from the Hirst-type traps and the automatic real-time systems are significantly but not closely correlated. This was concluded in [35] for the automatic pollen monitor (PoMo), where the correlation coefficient was in the range of 0.53-0.55.
In [19], the authors describe the creation of an image database in the form of a video sequence by recording reference material for 16 types of pollen. The main problem indicated by the authors is the grouping of pollen grains into agglomerates formed during the injection of pollen samples into their system. This is a severe problem because overlapping grains impede the detection thereof. The experimental results from their study specify the number of grains detected by the system; nevertheless, in our opinion, there is a lack of comparison with the ground truth.

Conclusions
According to the model, we have achieved average precision ranging from 91.7% to 97.8% for the YOLO detector tested with sets derived from DBAll and DBVisible. The recall average values were within the range of 89.7-98.9%. This result is highly satisfactory, especially when the pollen grains of two of the studied taxa have a similar structure. The discrimination between birch and hazel pollen grains is highly problematic, especially when some grains are out of focus. Alder pollen grains can be distinguished more easily: they have five pores, whereas birch and hazel pollen grains have three pores. Moreover, in Alnus grains, there are characteristic arci formed of exine with thickened bows from one porus to another.
In this study, we achieved similar detection results for YOLOv5 models built on different training sets. In particular, the differences in mAP@.5:.95 between models trained on all grains and visible ones are minimal. This allows us to conclude that not all grains have to be annotated in the training set used to create the detector. In particular, hard-torecognize grains do not have to be labeled. Additionally, for comparison, the best result for the other networks (Faster R-CNN and RetinaNet) was the 82% value of mAP@.5:.95, which is much lower than the result obtained from any YOLOv5 run tested on testAll.
It is worth noting that, despite the small number of training samples, the quality of the obtained models is very satisfying. Our research shows that YOLO can recognize multi-labeled images even if there are only single-labeled examples in the training set. It is a significant simplification of the preparation of the database for the detection of pollen grains, because one can use reference material without involving a palynologist in the annotation process. We forecast that it is possible to properly recognize taxa in original monitoring samples when YOLO is trained on single-labeled reference material with the same quality of photos.
We want to check our detectors on fresh biological material derived from the Hirst-type trap in the near future. We will also consider using the dataset available in [19].