A Plankton Detection Method Based on Neural Networks and Digital Holographic Imaging

Detecting marine plankton by means of digital holographic microscopy (DHM) has been successfully deployed in recent decades; however, in most previous studies, the identification of the position, shape, and size of plankton has been neglected, which may negate some of the advantages of DHM. Therefore, the procedure of image fusion has been added between the reconstruction of initial holograms and the final identification, which could help present all the images of plankton clearly in a volume of seawater. A new image fusion method called digital holographic microscopy-fully convolutional networks (DHM-FCN) is proposed, which is based on the improved fully convolutional networks (FCN). The DHM-FCN model runs 20 times faster than traditional image fusion methods and suppresses the noise in the holograms. All plankton in a 2 mm thick water body could be clearly represented in the fusion image. The edges of the plankton in the DHM-FCN fusion image are continuous and clear without speckle noise inside. The neural network model, YOLOv4, for plankton identification and localization, was established. A mean average precision (mAP) of 97.69% was obtained for five species, Alexandrium tamarense, Chattonella marina, Mesodinium rubrum, Scrippsiella trochoidea, and Prorocentrum lima. The results of this study could provide a fast image fusion method and a visual method to detect organisms in water.


Introduction
Plankton are a crucial component of the global ecology and play essential roles in carbon, nutrient cycling, and food chains [1,2]. Marine plankton is vital for assessing the water and aquatic ecosystem quality because of their short lifespan and strong sensitivity to environmental conditions [3]. Mesodinium rubrum is a phototrophic ciliate that hosts cryptophyte symbionts and is well known for its ability to form dense harmful algae blooms (HABs) worldwide [4]. Chattonella blooms have been associated with serious economic damage to the Japanese aquaculture industry since 1972 [5]. Therefore, a robust method for detecting plankton is of great importance for monitoring the marine ecological environment.
Several technologies for plankton monitoring have been established, such as traditional microscope technologies [6][7][8], the FlowCAM tool [9][10][11], fluorescence analysis, etc [12,13]. Traditional microscope technologies on the basis of computer vision have been popular in recent years; however, when photographing plankton, the field of view is limited and only the cell in the focal plane is imaged clearly [8]. FlowCAM can analyze particles or cells in a moving fluid. However, the accuracy of detection is greatly affected by undesirable particles, such as air bubbles entering the pump, which are also detected as plankton [11]. Moreover, the structure of the plankton may be destroyed when entering Step I, OP represents object plane, HP represents hologram plane, IP represents image plane, and FD represents focus distance. In Step II, the inputs of the model are the two reconstructed holograms; the output of the model is the label graph.

The Optical System of DHM
A coaxial optical path based on a Mach-Zendel interferometer structure was set up to obtain the best interference and highest interference contrast. The object and reference beams of the structure were strictly coaxial. Figure 2a shows a diagram of the DHM optical system and Figure 2b shows the setup of the experimental device. After passing through the expander, the laser (wavelength λ = 532 nm) is divided into an object beam and a reference beam using a beam splitter (BS). The cuvettes with the sample and reference liquids were set at the C1 and C2 positions, respectively. The beams passed through the cuvettes and microscope objectives (MO) and interfered after passing through the BS. To Step I, OP represents object plane, HP represents hologram plane, IP represents image plane, and FD represents focus distance. In Step II, the inputs of the model are the two reconstructed holograms; the output of the model is the label graph.

The Optical System of DHM
A coaxial optical path based on a Mach-Zendel interferometer structure was set up to obtain the best interference and highest interference contrast. The object and reference beams of the structure were strictly coaxial. Figure 2a shows a diagram of the DHM optical system and Figure 2b shows the setup of the experimental device. After passing through the expander, the laser (wavelength λ = 532 nm) is divided into an object beam and a reference beam using a beam splitter (BS). The cuvettes with the sample and reference liquids were set at the C1 and C2 positions, respectively. The beams passed through the cuvettes and microscope objectives (MO) and interfered after passing through the BS. To obtain a holographic image of the plankton, the interference pattern was recorded using a CMOS camera (AVT Manta G-419B PoE, 2048 × 2048, 5.5 µm × 5.5 µm pixels). The purpose of the cuvette with reference liquid is to minimize the impact of the liquid on imaging. The magnification of this optical system is 32.8 times. The optical path length of the cuvette is 2 mm. obtain a holographic image of the plankton, the interference pattern was recorded using a CMOS camera (AVT Manta G-419B PoE, 2048 × 2048, 5.5 μm × 5.5 μm pixels). The purpose of the cuvette with reference liquid is to minimize the impact of the liquid on imaging. The magnification of this optical system is 32.8 times. The optical path length of the cuvette is 2 mm.

Image Process
The procedures for plankton detection are divided into three key steps.
Step I consists of the hologram acquisition and reconstruction, which is shown in Figure 1. The holograms are captured by a CMOS camera that records the amplitude and phase information of the object's light field. A total of 1771 original holograms were acquired using the setup introduced in Section 2.2, including 335 for ATEC, 451 for CMSH, 415 for JAMR, 251 for PLGD, and 319 for STNJ. The image planes of the different focus distances were recovered from the original holograms using reconstruction algorithms. The holograms of the five plankton species are shown in Figure 3.

Image Process
The procedures for plankton detection are divided into three key steps.
Step I consists of the hologram acquisition and reconstruction, which is shown in Figure 1. The holograms are captured by a CMOS camera that records the amplitude and phase information of the object's light field. A total of 1771 original holograms were acquired using the setup introduced in Section 2.2, including 335 for ATEC, 451 for CMSH, 415 for JAMR, 251 for PLGD, and 319 for STNJ. The image planes of the different focus distances were recovered from the original holograms using reconstruction algorithms. The holograms of the five plankton species are shown in Figure 3. obtain a holographic image of the plankton, the interference pattern was recorded using a CMOS camera (AVT Manta G-419B PoE, 2048 × 2048, 5.5 μm × 5.5 μm pixels). The purpose of the cuvette with reference liquid is to minimize the impact of the liquid on imaging. The magnification of this optical system is 32.8 times. The optical path length of the cuvette is 2 mm.

Image Process
The procedures for plankton detection are divided into three key steps.
Step I consists of the hologram acquisition and reconstruction, which is shown in Figure 1. The holograms are captured by a CMOS camera that records the amplitude and phase information of the object's light field. A total of 1771 original holograms were acquired using the setup introduced in Section 2.2, including 335 for ATEC, 451 for CMSH, 415 for JAMR, 251 for PLGD, and 319 for STNJ. The image planes of the different focus distances were recovered from the original holograms using reconstruction algorithms. The holograms of the five plankton species are shown in Figure 3.

Reconstruction
A hologram recorded the complex wave-front of the water body on the CMOS camera and the intensity and phase of the cells in it could be recovered via the numerical reconstruction of the hologram. Each reconstructed hologram recovered one object plane according to the focus distance, which was the actual optical path length from the object plane to the CMOS plane. Various methods can be used to perform the reconstruction, such as the convolution method (CM), the Fresnel transformation method (FTM), and the angular spectrum method (ASM) [22,23]. The size of the reconstructed holograms obtained by CM is consistent with the original holograms. However, CM takes a long time to reconstruct and would cause spectrum under-sampling. FTM is a rapid method for reconstruction; however, the size of the reconstructed holograms varies with the reconstructed distance, which results in poor reconstructed hologram quality. ASM was applied in this study due to its strong ability to suppress the under-sampling problem, fast speed, and the consistent size of the images. ASM was employed to reconstruct the complex amplitude U(x, y, z) of the object, which is expressed as: where F and F −1 denote the Fourier transform and the inverse Fourier transform, respectively, Filter{·} is the frequency filter used to obtain the +1 order spectrum or the −1 order spectrum of the object, λ is the wavelength of the light source, z is the numerical propagation distance, and G f x , f y , z is the optical transfer function in the frequency domain corresponding to the propagation distance. The codes of ASM were established by our team according to Equation (1). As can be seen in Figure 1, reconstructed holograms with different focus distances were obtained from one hologram. It takes 0.2858 s to obtain a reconstructed image on a computer with a GeForce GTX 1660 graphics card, 8.00 GB of RAM, i5-9400F 2.90 GHz processor, and 64 nuclear CPU.
According to the different focal distances, the reconstructed holograms are shown in Figure 4. Figure 4a shows the original holograms obtained from the optical system. In Figure 4b, plankton cell A was reconstructed clearly. In Figure 4c, plankton cell B was in focus after reconstruction.
according to the focus distance, which was the actual optical path length from the object plane to the CMOS plane. Various methods can be used to perform the reconstruction, such as the convolution method (CM), the Fresnel transformation method (FTM), and the angular spectrum method (ASM) [22,23]. The size of the reconstructed holograms obtained by CM is consistent with the original holograms. However, CM takes a long time to reconstruct and would cause spectrum under-sampling. FTM is a rapid method for reconstruction; however, the size of the reconstructed holograms varies with the reconstructed distance, which results in poor reconstructed hologram quality. ASM was applied in this study due to its strong ability to suppress the under-sampling problem, fast speed, and the consistent size of the images. ASM was employed to reconstruct the complex amplitude   , , U x y z of the object, which is expressed as: x y denote the Fourier transform and the inverse Fourier transform, respectively,


Filter  is the frequency filter used to obtain the +1 order spectrum or the −1 order spectrum of the object, λ is the wavelength of the light source, z is the numerical propagation distance, and   , , x y G f f z is the optical transfer function in the frequency domain corresponding to the propagation distance. The codes of ASM were established by our team according to Equation (1). As can be seen in Figure 1, reconstructed holograms with different focus distances were obtained from one hologram. It takes 0.2858 s to obtain a reconstructed image on a computer with a GeForce GTX 1660 graphics card, 8.00 GB of RAM, i5-9400F 2.90 GHz processor, and 64 nuclear CPU. According to the different focal distances, the reconstructed holograms are shown in Figure 4. Figure 4a shows the original holograms obtained from the optical system. In Figure 4b, plankton cell A was reconstructed clearly. In Figure 4c, plankton cell B was in focus after reconstruction.

Proposed Method for Plankton Detection
In step II, a new imaged fusion method, 'DHM-FCN', is proposed to fuse two reconstructed holograms of different object planes at one time. The fusion image contains all the in-focus cells in the imaged water. In the last step, the images in the dataset are analyzed by K-means clustering, to obtain the sizes of the prior bounding boxes [24]. The detection model is trained with the bounding boxes and the dataset. The images are detected by the YOLOv4 and in this way, the graphical results about the categories, density, positions, shapes, and sizes of plankton in the imaged water are obtained. The two steps are shown in Figure 1.

Image Fusion
Image fusion combines information from multi-focus images of the same scene. The result of image fusion is a single image, which contains a more accurate description of the scene than any of the individual source images. This fused image is more useful and suitable for human vision, machine perception, or further image processing tasks [25]. For hologram processing, image fusion combines the reconstructed holograms of different object planes and brings all observed plankton cells into focus to be represented in a fusion image. Image fusion for reconstructed holograms was carried out by various algorithms, such as wavelet transform, which fuse the source images in the wavelet domain according to fusion rules [26]. In this paper, a new image fusion method, 'DHM-FCN', was proposed, and the DHM-FCN method was compared to the methods based on wavelet transformation with different rules and the pulse-coupled neural networks (PCNN) method [27].
The improved neural network is based on FCN that takes inputs of arbitrary size and produces corresponding-sized outputs with efficient inference and learning. In this study, the inputs of the network are two reconstructed holograms with different focus distances from one original hologram. The output of the neural network is the fusion weight matrix of the inputs. The weights of the focused plankton regions in the first reconstructed hologram were set to 0 and the weights of the in-focus plankton regions in the second reconstructed hologram were set to 1. The other regions were the background and their weight values were set to 0.5. The diffraction fringes in the background of different reconstructed images are different. In order to smooth the high-frequency diffraction fringes in the background, the weight values of the background are important. Taking the weight matrix as the pixel value matrix of the grayscale image, a visual representation of the weight matrix can be obtained, which is called a label graph, as shown in Figure 5. The structure diagram of the neural network is shown in Figure 1. Each layer of data is a three-dimensional array of size h × w × d, where h and w are spatial dimensions and d is the feature or channel dimension. The first layer is the input, with a pixel size of 512 × In order to speed up the algorithm, a simple structure was adopted for the neural network. The trainable parameters of the traditional FCN model for weight matrix regression are compared with the DHM-FCN model in Table 1. The simplified structure of DHM-FCN greatly increased the speed of the algorithm and reduced the time required for training. Considering the limitations of the device and the speed of the algorithm, the input of this experiment was uniformly scaled to 512 × 512. The structure diagram of the neural network is shown in Figure 1. Each layer of data is a three-dimensional array of size h × w × d, where h and w are spatial dimensions and d is the feature or channel dimension. The first layer is the input, with a pixel size of 512 × 512 and two channels. The convolution layers are used to extract the image features. The functions of the pooling layers are to select the features extracted from the convolution layer, reduce the feature dimension, and avoid overfitting. Deconvolution layers are used to restore the size of the feature map and achieve the regression of weight values. The skip architecture defined in DHM-FCN uses eltwise+ layers that combine different feature maps to better regress the weight matrix. In the network, up-sampling layers enable weight value regression and learning in nets with subsampled pooling. The last layer is output, which is the weight matrix of the input images used for the reconstructed holograms' fusion.
The fused image F(x, y) is expressed as: F(x, y) = A(x, y) (O − P(x, y)) + B(x, y) P(x, y) where O denotes that all the elements in the matrix are 1. A(x, y) and B(x, y) are the first and second reconstructed holograms, respectively. P(x, y) is the weight matrix of A(x, y) and B(x, y) obtained from the DHM-FCN. The structure of the DHM-FCN is shown in Table 2. Training of the DHM-FCN model: In this work, each set of data contained two reconstructed holograms and the corresponding manual weight matrix during training. The two reconstructed holograms in each group corresponded to different reconstruction distances of the same hologram. An optimizer that implements the Adam algorithm was used. The batch size was set to 1 with an initial learning rate of 0.001. The loss function was a mean square error: where m and n denote the scales of the weight matrix. z ij and y ij are the predicted weight value and the real weight value. The DHM-FCN model with the loss function achieved the regression of the weight values in the weight matrix.

Object Detection
In the past several years, computer versions, especially object detection, boomed with the development of computers. There are two types of object detection algorithms based on deep learning: two-stage object detection algorithms and one-stage object detection algorithms [28,29]. The two-stage algorithms generate a candidate region that contains the object to be detected after the feature extraction in the first step, and the second step performs the classification and localization regression through the classifiers. Two-stage algorithms include regional convolutional neural networks (R-CNN) [30], SPP-Net [31], fast regional convolutional neural networks (Fast R-CNN) [32], faster regional convolutional neural networks (Faster R-CNN) [29], etc. Such algorithms are characterized by high accuracy but slow speed. The one-stage algorithms do not generate candidate regions but directly classify and locate them after feature extraction. These algorithms run faster than the two-stage algorithms. Many researchers have improved the one-stage algorithms, and now the one-stage algorithm performs well on object detection tasks in various fields. Examples of one-stage neural networks include SSD [28], YOLO [33] and YOLOv4 [21]. The original YOLO model that was used for object detection was developed by Joseph et al. in 2016 [33]. Many have subsequently improved on this model [34,35]. The model used in this paper was YOLOv4, which was an object detector in production systems with optimization for parallel computations and fast operating speed, developed by Alexey et al. in 2020 [21]. YOLOv4 can be trained and used with a conventional GPU with 8-16 GB-VRAM.
YOLOv4 is divided into three parts: the backbone, neck, and head. The backbone of the network is responsible for extracting features from the image, which largely determines the performance of the detector. Generally, the backbone is a convolutional neural network (CNN) that has been pre-trained on ImageNet [36] or COCO [37] datasets. The head is the network layers that are randomly initialized during training after the backbone to make up for the defects that CNN cannot locate. Between the backbone and the head, some network layers are added to collect feature maps in different stages, usually called the neck.
YOLOv4 uses CSPDarknet53 [38] as the backbone, which is pre-trained on ImageNet, SPP [31] additional module and PANet [39] path-aggregation as the neck, and YOLOv3 (anchor-based) head, which is used to predict the classes and bounding boxes of objects.
Acquisition of prior bounding box: A prior bounding box is a bounding box with different sizes and different aspect ratios, which is preset in the image in advance. The detectors predict whether the box contains plankton and its coordinates. Prior bounding boxes allow the model to learn more easily. Setting prior bounding boxes of different scales increases the probability of a box that has a good match for the target object appearing. It is crucial to select the number and size of bounding boxes. In this paper, K-means clustering is used instead of manually designing anchors [24]. By clustering the bounding box of the training dataset, a set of bounding boxes that are more suitable for the dataset are automatically generated, which makes the detection effect of the network better. The K-means algorithm divides the data into k categories, where k needs to be preset. The length and width of plankton in the training dataset were used as initial data to experiment with K-means.
Training of the YOLOv4 model: To improve the speed of the detection algorithm, the input of YOLOv4 was scaled to 608 × 608. The transfer learning method was applied to speed up and optimize the learning efficiency instead of training the model with randomly initialized weights from scratch, which was a technology that took the well-trained model parameters as the initial values of our own model parameters. The loss was divided into three parts: location, confidence, and category loss. In this study, the loss function adopted was proposed by Alexey et al. [21].
Evaluation criteria: Recall, average precision (AP), and the F1 score were used to evaluate the performance of the neural networks in this paper. Recall was expressed as: where TP is the number of positive samples that were correctly predicted and FN is the number of samples belonging to positive samples that were predicted to be negative. IOU refers to the ratio of the overlapping area to the combined area of the two bounding boxes. AP is an evaluation index that comprehensively considers recall and accuracy. They are expressed as: where TB means truth boxes and PB means predicted boxes. Additionally, t indicates that when the IOU is larger than t, the prediction is assumed to have correctly detected this target. Furthermore, p(t) is the precision when the recall is r(t). The F1 score is an evaluation index combining precision and recall. The F1 score is defined as follows: The location of the cell was evaluated by the focus distance and the coordinates of object detection bounding boxes. The size of the cells was evaluated by the size of the bounding box and the optical system parameters, which were expressed as: where IR is the scale of the original hologram, PS is the size of the pixel, Mag is the magnification of the optical system, and Sc is the scale of the detector's input.

Image Fusion Method
Data augmentation: The data for image fusion were reconstructed holograms. Neural networks' performance highly depends on the training dataset's size. Large datasets can increase the performance of the networks. To obtain as many images as possible, the data augmentation method was applied to enlarge the dataset. In this study, there are a total of 2464 reconstructed holograms in groups of 1232. Brightness adjustment, horizontal flip, vertical flip, and diagonal flip technologies were used for image augmentation. The dataset was divided into 80% for training the model, 10% for validation, and 10% for testing.

Comparison of Different Image Fusion Methods
In this study, several image fusion methods were compared in terms of their processing speed and fusion effects. The inputs of all algorithms are shown in Figure 4b,c. The results of different methods are shown in Figure 6. Figure 6a,d are the results of PCNN, the regional wavelet transformation method (RW) [27], the pixel wavelet transformation method [19] and the method proposed in this paper (DHM-FCN), respectively. The weight matrix obtained by DHM-FCN is shown in Figure 5. PW considers the coefficients of each pixel individually. RW divides the sub-images of the original image into different regions and determines the overall coefficient of the entire region according to the regional characteristics.
are from the blurred reconstructed image. Speckle noise was caused by the fact that some pixels inside the clear plankton were taken from the blurred reconstructed image. This resulted in high-frequency diffraction fringes around the focused plankton and speckle noise in the fusion image, which degraded the quality of the result.
The weighted calculation of the reconstructed image with the weight matrix smoothed the background noise. DHM-FCN overcomes the problems of speckle noise and diffraction fringes in the wavelet transform method and discontinuous edge in PCNN. It is particularly important for microplankton detection, such as STNJ and ATEC, which are small and similar in morphology. If speckle noise is generated in the plankton, a lot of information can be lost, resulting in inaccurate identification. DHM-FCN can quickly generate high-quality fusion images, which lays the foundation for the next step in object detection. Visually, DHM-FCN can effectively suppress background noise compared with other methods and the original images, and obtain more continuous cell boundaries. Statistically speaking, the large SSIM, Cor, and PSNR of The time required by the different algorithms to process image fusion on the same computer is listed in Table 3. The structural similarity (SSIM) and image correlation coefficient (Cor) between the result and the input images were used to evaluate the performance of the algorithms. The peak signal to noise ratio (PSNR) was also used. All the image fusion methods mentioned above fuse two reconstructed holograms every time. The density of plankton is generally low in seawater, and it is rare for a hologram to contain more than two cells. However, during occurrences of harmful algae blooms, the density of plankton increases. In this case, two reconstructed holograms are fused and this fusion image is fused with the third reconstructed hologram until the fusion images obtained contain all the in-focus cells in the imaged water.
In the reconstructed holograms, the regions of blurred cells contained high-frequency diffraction fringes around plankton. Since the fusion of the high-frequency part by the WT method takes the maximum value, the pixels of the in-focus cells in the fused image are from the corresponding clearly reconstructed image and the pixels surrounding the cells are from the blurred reconstructed image. Speckle noise was caused by the fact that some pixels inside the clear plankton were taken from the blurred reconstructed image. This resulted in high-frequency diffraction fringes around the focused plankton and speckle noise in the fusion image, which degraded the quality of the result.
The weighted calculation of the reconstructed image with the weight matrix smoothed the background noise. DHM-FCN overcomes the problems of speckle noise and diffraction fringes in the wavelet transform method and discontinuous edge in PCNN. It is particularly important for microplankton detection, such as STNJ and ATEC, which are small and similar in morphology. If speckle noise is generated in the plankton, a lot of information can be lost, resulting in inaccurate identification.
DHM-FCN can quickly generate high-quality fusion images, which lays the foundation for the next step in object detection. Visually, DHM-FCN can effectively suppress background noise compared with other methods and the original images, and obtain more continuous cell boundaries. Statistically speaking, the large SSIM, Cor, and PSNR of DHM-FCN mean that this method recovers the plankton information from the input images efficaciously. DHM-FCN represents a significant improvement in the performance of fusion according to the evaluation metrics. DHM-FCN is 20 times faster than PW and RW and 5 times faster than PCNN. In this study, we adopted the DHM-FCN method to fuse the reconstructed images.

Object Detection
The experimental results of K-means are presented in Figure 7. By setting more bounding boxes, the performance of the model was improved to a certain extent, however, the complexity increased. To balance the accuracy and complexity of the model, DHM-FCN mean that this method recovers the plankton information from the input images efficaciously. DHM-FCN represents a significant improvement in the performance of fusion according to the evaluation metrics. DHM-FCN is 20 times faster than PW and RW and 5 times faster than PCNN. In this study, we adopted the DHM-FCN method to fuse the reconstructed images.

Object Detection
The experimental results of K-means are presented in Figure 7. By setting more bounding boxes, the performance of the model was improved to a certain extent, however, the complexity increased. To balance the accuracy and complexity of the model, For data augmentation, the mosaic method of data augmentation used in this paper was a new technology proposed in YOLOv4. This method randomly uses four images for scaling, distribution, and stitching, which greatly enriched the detection dataset. In particular, random scaling added a lot of small targets, making the network more robust. The input data were the fusion images and the reconstructed holograms that only contained one cell. This differs from the dataset of DHM-FCN. The size of the original dataset was 1771. The dataset was divided into 80% for training the model, 10% for validation, and 10% for testing.
The results of the plankton detection are shown in Table 4. The mAP of the test dataset was 97.69%, and the results of plankton detection are shown in Figure 8.
During the hologram processing, the intensity of the reconstructed images varied with the reconstructed distance. In this study, YOLOv4 was robust and efficient for images with different contrasts and intensities.   For data augmentation, the mosaic method of data augmentation used in this paper was a new technology proposed in YOLOv4. This method randomly uses four images for scaling, distribution, and stitching, which greatly enriched the detection dataset. In particular, random scaling added a lot of small targets, making the network more robust. The input data were the fusion images and the reconstructed holograms that only contained one cell. This differs from the dataset of DHM-FCN. The size of the original dataset was 1771. The dataset was divided into 80% for training the model, 10% for validation, and 10% for testing.
The results of the plankton detection are shown in Table 4. The mAP of the test dataset was 97.69%, and the results of plankton detection are shown in Figure 8.

Conclusions
The characteristics of a large depth of field imaging make it possible to detect a certain thickness of water at one time with DHM. However, the current detection methods discard the 3D information of holographic imaging, which can only identify the species and detect the concentration of plankton.