Real-Time 3D Tracking of Multi-Particle in the Wide-Field Illumination Based on Deep Learning

In diverse realms of research, such as holographic optical tweezer mechanical measurements, colloidal particle motion state examinations, cell tracking, and drug delivery, the localization and analysis of particle motion command paramount significance. Algorithms ranging from conventional numerical methods to advanced deep-learning networks mark substantial strides in the sphere of particle orientation analysis. However, the need for datasets has hindered the application of deep learning in particle tracking. In this work, we elucidated an efficacious methodology pivoted toward generating synthetic datasets conducive to this domain that resonates with robustness and precision when applied to real-world data of tracking 3D particles. We developed a 3D real-time particle positioning network based on the CenterNet network. After conducting experiments, our network has achieved a horizontal positioning error of 0.0478 μm and a z-axis positioning error of 0.1990 μm. It shows the capability to handle real-time tracking of particles, diverse in dimensions, near the focal plane with high precision. In addition, we have rendered all datasets cultivated during this investigation accessible.


Introduction
Particle tracking serves as an effective approach to measuring and scrutinizing the motion of tiny objects, ranging from single molecules to cells.It finds wide-ranging applications in various fields, such as colloid physics, medical treatments, biological science, pharmaceutical research, and cellular behavior studies.For instance, it can be applied to particle tracking in holographic optical tweezers to derive the Mean Squared Displacement (MSD) curve, calibrate the optical stiffness [1,2], or analyze the movements of self-propelled particles [3].In the field of biological sciences, particle localization is a technique used to observe the invasion process of viruses into cells [4,5] and analyze the diffusion dynamics of nanoparticles near cell membranes [6].Meanwhile, it is also possible to count bacterial colonies using particle localization [7].When it comes to drug applications, the reliability of carriers can be assessed by conducting a direct tracking of these carriers [8].
Traditional numerical calculation methods have demonstrated notable effectiveness in the domain of particle tracking [9][10][11][12].Research by Parthasarathy et al. [9] has revealed that the imaging particle's intensity is radially symmetric around its center.Based on this principle, they designed a rapid and precise particle tracking algorithm capable of sub-pixel positioning accuracy.Similarly, Kashkanova, A.D. et al. [10] introduced an image transformation technique to accentuate features possessing a high degree of radial symmetry.However, traditional numerical calculations come with numerous drawbacks, such as heavy computational demands, the inability to track multi-particles simultaneously, and the requirement for pre-set parameters.To overcome these challenges, researchers have begun to merge deep learning with particle localization, developing several highperforming tracking algorithms.Using Mie scattering theory to fit holograms is a common method for obtaining the three-dimensional information of particles.However, this method involves high computation and can only calculate the information for one particle at a time.More details can be found in Supplementary Materials.
Nearly every method encounters the challenge of non-existent datasets and the complexity of optical systems.To address this issue, we chose wide-field illumination instead of a well-coherent laser source and developed a simple and efficient method for generating datasets.Acquiring holograms using wide-field illumination offers several advantages over those obtained via single-wavelength irradiation.It is more straightforward to operate and capable of distinguishing between particles located above and below the focus, which is hard to realize using a single-wavelength laser source [24].Meanwhile, a single-wavelength laser source can generate strong interference, which is why many researchers use a system based on a well-coherent laser source to do the tracking.Such a system can perform well when the observing environment is ideal.However, such properties also magnify the structural defects of devices.For example, if we want to track particles in microchannels (channel sizes ranging from several micrometers to hundreds of micrometers) to analyze the drug delivery process, Casimir force measurements, group behavior of multi-particles, the tiny structure of microchannels, and the defects inside will affect the accuracy of particle tracking by adding unnecessary interference noise to the background generated by a laser source, which is inevitable during the process and limits the tracking performance.However, wide-field illumination will significantly reduce background noise, expanding the application scenarios of our system.Furthermore, datasets serve as integral components of network training.However, most relevant works neither detail the processing of datasets nor disclose the datasets.Hence, this paper presents a straightforward, efficient, and low-cost 3D multi-particle tracking system.We used artificial datasets to train networks and then applied such networks to real-world situations.We have provided a detailed description of the dataset creation process, which effectively fills a gap in this field regarding dataset creation.
Furthermore, experimental findings have confirmed that networks trained on synthetic datasets maintain their effectiveness when applied to real data.Obtaining a large-scale dataset with accurate and detailed annotations takes time and effort, although our proposed method is simple and provides an effective solution.
Drawing inspiration from the CenterNet network, we approached particle localization as a unique application for keypoint detection and integrated the concept of a Feature Pyramid Network (FPN) [25] to propose a novel 3D localization network.Research findings demonstrate that this network has the capability to achieve real-time sub-micron-level planar positioning accuracy and axial positioning accuracy on both artificial and real datasets.Figure 1a depicts the workflow for analyzing particle holography using deep learning.The results from the trained network involve heat maps, offset maps, and depth Sensors 2024, 24, 2583 3 of 13 maps.Utilizing heat maps and offset maps makes it feasible to calculate the horizontal position of particles in each image (Figure 1b).
Sensors 2024, 24, x FOR PEER REVIEW 3 of 14 learning.The results from the trained network involve heat maps, offset maps, and depth maps.Utilizing heat maps and offset maps makes it feasible to calculate the horizontal position of particles in each image (Figure 1b).

Production of Datasets
To facilitate the training of this network, we necessitated the preparation of a substantial quantity of data generated via the aforementioned optical tracking device.Manually labeling a significant volume of images and individually locating each particle proved challenging and time-consuming.The method we propose to make datasets solves this problem well.
A critical element of this process was the calibrating of each particle.Low labeling accuracy affects the predictive capabilities and the convergence of the network during training.In this experiment, we employed a high-precision stage to calibrate the particles at each location and capture the feature images.Given that the polystyrene beads with radii 1-µm and 2-µm were affixed to the cover glass, we could effectively discount movement around the coordinates attributable to Brownian motion.We scanned the beads from 10 µm below to above the focal plane along the z-axis, subdividing the distance into an average of 400 segments, each measuring 50 nm.

Methodology 2.1. Production of Datasets
To facilitate the training of this network, we necessitated the preparation of a substantial quantity of data generated via the aforementioned optical tracking device.Manually labeling a significant volume of images and individually locating each particle proved challenging and time-consuming.The method we propose to make datasets solves this problem well.
A critical element of this process was the calibrating of each particle.Low labeling accuracy affects the predictive capabilities and the convergence of the network during training.In this experiment, we employed a high-precision stage to calibrate the particles at each location and capture the feature images.Given that the polystyrene beads with radii 1-µm and 2-µm were affixed to the cover glass, we could effectively discount movement around the coordinates attributable to Brownian motion.We scanned the beads from 10 µm below to above the focal plane along the z-axis, subdividing the distance into an average of 400 segments, each measuring 50 nm.
We then captured the feature images of 1-µm and 2-µm beads at each position (Figure 2a).Following this procedure, we could calibrate 800 images of beads at differing positions.As shown in Figure 2b, as each feature image contained only one object, there were virtually no instances of inaccurate localization [9,10].The images were randomly spliced into one image (Figure 2e) [26].The difference between the synthesized and original images is random, which does not mislead the network to make incorrect judgments during the training process (Figure 2d).This approach enabled us to generate many bead images, each with well-calibrated data in the x, y, and z positions.The images in Figure 2c show some synthetic images.
We then captured the feature images of 1-µm and 2-µm beads at each position (Figure 2a).Following this procedure, we could calibrate 800 images of beads at differing positions.As shown in Figure 2b, as each feature image contained only one object, there were virtually no instances of inaccurate localization [9,10].The images were randomly spliced into one image (Figure 2e) [26].The difference between the synthesized and original images is random, which does not mislead the network to make incorrect judgments during the training process (Figure 2d).This approach enabled us to generate many bead images, each with well-calibrated data in the , , and  positions.The images in Figure 2c show some synthetic images.

Structure of the Network
Deep learning plays a crucial role in the field of object detection.Not only have researchers proposed numerous excellent network architectures, but they have also introduced innovative algorithms such as Single Shot MultiBox Detector (SSD) [27] and Feature Pyramid Network (FPN).In Figure 3a, the flowchart of our network is presented.As our network is built upon the concept of the CenterNet network, for convenience, we refer to our network as CenterXFNet.Our network consists of three primary components: the Backbone; Neck; and Head modules.
For the Backbone network, we have selected Resnet50, a widely used and well-established choice among backbone networks.The Neck module primarily serves the purpose of information fusion.While a conventional CenterNet network performs only three standard upsampling operations during the Neck phase, our images have distinct

Structure of the Network
Deep learning plays a crucial role in the field of object detection.Not only have researchers proposed numerous excellent network architectures, but they have also introduced innovative algorithms such as Single Shot MultiBox Detector (SSD) [27] and Feature Pyramid Network (FPN).In Figure 3a, the flowchart of our network is presented.As our network is built upon the concept of the CenterNet network, for convenience, we refer to our network as CenterXFNet.Our network consists of three primary components: the Backbone; Neck; and Head modules.
For the Backbone network, we have selected Resnet50, a widely used and wellestablished choice among backbone networks.The Neck module primarily serves the purpose of information fusion.While a conventional CenterNet network performs only three standard upsampling operations during the Neck phase, our images have distinct characteristics with softly blurred edges in the particle pattern and a subtle difference between the images at varying depths.Softly blurred edges can make localization more difficult.Minor differences between images of different depths can cause the network's depth prediction difficulties.These characteristics necessitate a better approach to blending our features.Our proposed Neck module draws concepts from the principles of FPN, as illustrated in Figure 3a.The Neck module mainly comprises several ELAN and SPP + CSP modules [16], as shown in Figure 3b,c.Tables 1 and 2 show the relevant parameter settings in the Neck module.This methodology significantly enhances the informational density of the subsequent features.Moreover, this module can also be integrated into other network architectures, potentially improving their performance in object detection and localization tasks.
characteristics with softly blurred edges in the particle pattern and a subtle difference between the images at varying depths.Softly blurred edges can make localization more difficult.Minor differences between images of different depths can cause the network's depth prediction difficulties.These characteristics necessitate a better approach to blending our features.Our proposed Neck module draws concepts from the principles of FPN, as illustrated in Figure 3a.The Neck module mainly comprises several ELAN and SPP + CSP modules [16], as shown in Figure 3b,c.Tables 1 and 2 show the relevant parameter settings in the Neck module.This methodology significantly enhances the informational density of the subsequent features.Moreover, this module can also be integrated into other network architectures, potentially improving their performance in object detection and localization tasks.

Sample Preparation
We chose 1-µm and 2-µm polystyrene beads to prepare the standard calibration solution.We centrifuged (1200 rpm) the beads and transferred them into a sodium chloride solution (1 M).We proceeded by depositing 120 µL of this solution into the spacer, which was then placed in an oven (70 • C) overnight.This process was designed to ensure the beads would adhere to the surface of the cover glass.Following this, we thoroughly and gently cleaned the cover glass to eliminate any beads that had not bonded.Finally, we filled the spacer with deionized water and carefully sealed it.

Tracking System Setup
We chose wide-field illumination (Figure 4) to generate the dataset.A nano piezo z-stage (MCL) used in conjunction with the x − y stage (Nikon) calibrates the spatial coordinates.Given the precise accuracy requirement along the z-axis, each step measures 50 nm along this axis and 1 µm in the x − y plane.Movement is regulated by a PC-controlled stage, and a camera (Basler acA5472-17µmMED) captures several images at each position for calibration.Then, the video is processed by a PC to track multi-particle 3D positions in real time.The tracking software used in the experiments is self-developed.

Sample Preparation
We chose 1-µm and 2-µm polystyrene beads to prepare the standard calibration solution.We centrifuged (1200 rpm) the beads and transferred them into a sodium chloride solution (1 M).We proceeded by depositing 120 µL of this solution into the spacer, which was then placed in an oven (70 °C) overnight.This process was designed to ensure the beads would adhere to the surface of the cover glass.Following this, we thoroughly and gently cleaned the cover glass to eliminate any beads that had not bonded.Finally, we filled the spacer with deionized water and carefully sealed it.

Tracking System Setup
We chose wide-field illumination (Figure 4) to generate the dataset.A nano piezo stage (MCL) used in conjunction with the   stage (Nikon) calibrates the spatial coordinates.Given the precise accuracy requirement along the z-axis, each step measures 50 nm along this axis and 1 µm in the   plane.Movement is regulated by a PC-controlled stage, and a camera (Basler acA5472-17µmMED) captures several images at each position for calibration.Then, the video is processed by a PC to track multi-particle 3D positions in real time.The tracking software used in the experiments is self-developed.

Training and Evaluation
We created a dataset comprising 12,000 images to train the network, utilizing our innovative data production method.This training dataset consisted of 8400 images, while the evaluation dataset contained 3600 images.This dataset included two distinct types of particles with radii measuring 1 µm and 2 µm, positioned along the z-axis within a range of +10 µm to −10 µm from the focal plane.The image size varied between 950 and 1050 pixels, with each pixel approximately corresponding to a scale of 23.3 nm.
We utilized two types of test datasets: artificial; and real datasets.The artificial dataset comprised 3352 targets generated using the same method as the training dataset.On the other hand, the real dataset consisted of 2976 objects captured by the microscope during the experiment.For further training details, please refer to the Supplementary Materials.

Training and Evaluation
We created a dataset comprising 12,000 images to train the network, utilizing our innovative data production method.This training dataset consisted of 8400 images, while the evaluation dataset contained 3600 images.This dataset included two distinct types of particles with radii measuring 1 µm and 2 µm, positioned along the z-axis within a range of +10 µm to −10 µm from the focal plane.The image size varied between 950 and 1050 pixels, with each pixel approximately corresponding to a scale of 23.3 nm.
We utilized two types of test datasets: artificial; and real datasets.The artificial dataset comprised 3352 targets generated using the same method as the training dataset.On the other hand, the real dataset consisted of 2976 objects captured by the microscope during the experiment.For further training details, please refer to the Supplementary Materials.

Data Tracking Analysis
We evaluated the performance of our network using both artificial and real test datasets, and the corresponding results are presented in Figure 5.All three networks can achieve sub-micron-level positioning accuracy, whether on artificial or natural datasets.Specifically, the CenterNet network outperforms the CenterXFNet network on the artificial dataset, with the best horizontal and vertical positioning accuracies recorded at 0.0512 µm and 0.1740 µm, respectively.Meanwhile, the CenterXFNet-ResNet50 network exhibits significantly superior positioning capability on the real dataset compared to the other two networks, with the best horizontal and vertical positioning accuracies measured at 0.0478 µm and 0.1990 µm, respectively (Figure 5a).There are substantial disparities between natural and artificial data, which accounts for the differing positioning performance of the networks across the two datasets.The exceptional performance of the CenterXFNet on the real dataset suggests that, in contrast with the traditional CenterNet network, our proposed network has assimilated more genuinely advantageous features for particle positioning during the training process rather than merely focusing on the ineffectual features present in the artificial dataset.
Sensors 2024, 24, x FOR PEER REVIEW 8 of 14 application, we value inference ability more in the vertical direction, so we chose to combine two modules.Figure 5a illustrates particle size's significant impact on positioning accuracy, particularly in the vertical direction.The network exhibits higher prediction accuracy for larger particles than small particles.This result is logical, as achieving precise positioning for small objects is a significant challenge in object detection.
Figure 5b,c shows the distribution of three-dimensional horizontal and vertical errors in the natural and synthetic test datasets.Additionally, it is worth emphasizing that their depth does not significantly affect the accuracy of predicting particles.In most research papers, obtaining depth information for particles above or below the focal plane is challenging, as the particle patterns observed in other experiments show no significant differences.However, the wide-field illumination technique we employed enables us to observe the differences between these particle patterns easily.
Figure 6a displays the performance of CenterXFNet-Resnet50.This network demonstrates outstanding positioning capability in both artificial and natural images.The particle tracking video can be found in the referenced document (Video S1) for a more comprehensive understanding.
Sensors 2024, 24, x FOR PEER REVIEW 9 of 14   We also compared the CenterXFNet network with the Mask-RCNN network.We trained the CenterXFNet network and the Mask-RCNN network for 200 epochs separately.Figure 6b shows the comparative results of the two networks on the artificial dataset.Mask-RCNN demonstrates higher vertical prediction accuracy than the CenterXFNet network.However, the horizontal positioning capability is the opposite (Figure 6b).However, Mask-RCNN's positioning ability on the real dataset is so poor that we cannot evaluate the network's positioning ability on the real dataset (Figure 6c,d).We believe that similar to the traditional CenterNet, Mask-RCNN has been misled by irrelevant information in the artificial dataset during the learning process and has failed to learn the correct features.This further indicates that the CenterXFNet we proposed has superior feature learning capabilities.Additionally, the Mask-RCNN network makes it hard to meet real-time detection requirements.
To verify the reliability and reproducibility of our network, we have conducted a series of experiments (Tables 3 and 4).In the horizontal direction, the positioning performance of the CenterXFNet network (without ELAN) shows better results (Table 3).However, the CenterXFNet network demonstrates more accurate localization capabilities in the vertical direction, especially for big particles (Table 4).The SPP + CSP and ELAN modules play different roles in predicting the three-dimensional information of beads.The ELAN module plays a very positive role in predicting the horizontal position of beads.However, the SPP + CSP and ELAN modules are essential for z-axis prediction.In our application, we value inference ability more in the vertical direction, so we chose to combine two modules.

Real-Time Tracking
We implemented the trained network for real-time tracking and compared its performance on both large and small particles (Figure 7).To test the tracking system's performance, the high-density particle mixture solution was allowed to settle within spacers for an adequate amount of time, causing the particles to deposit near the bottom of the cover glass and then stick to the cover glass surface.In this way, it can be considered that the unique displacement of the stage is the same as that of the beads, excluding the influence of the Brownian motion of the bead itself.We subsequently manipulated the electronic stage to different heights at −5 µm, −1 µm, 0 µm, 1 µm, 5 µm, and 10 µm (relative to the focal plane) and conducted real-time tracking to assess the system's reliability by comparing the tracking results along the z-axis and the movements of the stage.
Figure 7b-d shows the tracking results of three different networks.It is pretty evident that the CenterXFNet-ResNet50 network is the best at what it achieves, and most of the relative tracking data of bead movements fall within a range of less than 10% of the bead size itself, spanning from −5 µm to 10 µm.We need to note that in the extensive particle real-time tracking experiments, the particle prediction results fluctuated because of the occurrence of particle collisions.In most cases, the networks with our proposed structure added in the Neck phase showed better predictions than the original networks.Comparison results can be found in Figures S3 and S4

Conclusions
In our network, we incorporate the concept of key point detection, achieving submicron precision in both horizontal position and vertical depth positioning.This network

Conclusions
In our network, we incorporate the concept of key point detection, achieving submicron precision in both horizontal position and vertical depth positioning.This network is designed as an end-to-end system with a simplified architecture, which facilitates debugging and enhancement.We propose a method for dataset generation that is both rapid and straightforward.Meanwhile, we propose using wide-field illumination instead of single-wavelength illumination, which reduces costs and decreases noise, expanding the applicability of our method.Importantly, we demonstrate that networks trained on these synthetically produced datasets are also effective when applied to real data and real-time tracking, which can give guidance in colloid biophysics and drug delivery research in medical treatment.
Utilizing the Mie scattering theorem to obtain three-dimensional position information of particles presents several drawbacks, including significant computational complexity, the inability to simultaneously track multiple particles, and the necessity for predefined parameters [28].In contrast, our proposed algorithm enables real-time multi-particle tracking without prior knowledge.The successful implementation of our algorithm implies the necessity to reassess the application of the Mie scattering theorem.Furthermore, it inspires us to explore the potential integration of deep learning with other fields.For instance, applying the Rayleigh-Sommerfeld formula for light field reconstruction requires high computational complexity and requires us to filter the optimal target points artificially [28].This encourages us to investigate whether deep-learning methods can effectively address these limitations and expand their application scope.
At present, our network is capable of detecting spherical particles.However, with the appropriate training, the detector can accurately position particles of other shapes, including rods, cones, or irregular shapes.
A large-scale dataset with accurate and detailed calibration information is indispensable to address the above-mentioned issue.However, obtaining such image data is very troublesome and challenging.In the absence of a large-scale dataset, we should consider employing unsupervised learning methods to broaden the applicability of the locator to various scenarios.The emergence of large-scale networks such as CLIP [29] also opens the possibility of utilizing one-shot learning methods and few-shot learning methods.Our network's simplicity allows for more optimization potential to achieve higher precision in particle positioning.Moreover, further simplifying and deploying the network to an Edge Computing Device could significantly enhance its efficiency.

Figure 1 .
Figure 1.(a) Schematic representation of particle holograms analyzed using deep learning (Red square: Big particle; Blue square: Small particle).(b) Illustrated description of calculating particles' centers.

Figure 1 .
Figure 1.(a) Schematic representation of particle holograms analyzed using deep learning (Red square: Big particle; Blue square: Small particle).(b) Illustrated description of calculating particles' centers.

Figure 2 .
Figure 2. Schematic diagram of the process of creating artificial datasets.(a) Experimental images with different depths.The red and blue boxes indicate the small and large beads, respectively.(b) Feature images at different depths, with the red and blue boxes representing the small and large beads.The focal plane is at 50 µm.(c) Multiple particles with different depths in a synthetic image.(d) Comparison between artificial and original images, showing small differences characterized by randomization.(e) Schematic overview of the image fusion process.

Figure 2 .
Figure 2. Schematic diagram of the process of creating artificial datasets.(a) Experimental images with different depths.The red and blue boxes indicate the small and large beads, respectively.(b) Feature images at different depths, with the red and blue boxes representing the small and large beads.The focal plane is at 50 µm.(c) Multiple particles with different depths in a synthetic image.(d) Comparison between artificial and original images, showing small differences characterized by randomization.(e) Schematic overview of the image fusion process.

Figure 3 .
Figure 3. Schematic overview of the Estimator network.(a) Schematic diagram of network flow.A hologram is scaled to a standard size of 1024 pixels×1024 pixels before being fed into the network.(b,c) The schema of ELAN and SPP + CSP; 1 1 and 3 3 means the size of the convolution kernel; c1 and c2 mean channel amount of feature maps.

Figure 3 .
Figure 3. Schematic overview of the Estimator network.(a) Schematic diagram of network flow.A hologram is scaled to a standard size of 1024 pixels×1024 pixels before being fed into the network.(b,c) The schema of ELAN and SPP + CSP; 1 × 1 and 3 × 3 means the size of the convolution kernel; c1 and c2 mean channel amount of feature maps.

Figure 4 .
Figure 4.The optical path of the tracking system.

Figure 4 .
Figure 4.The optical path of the tracking system.

Figure 5 .
Figure 5. (a) The absolute error of horizontal and vertical positions in artificial data and real data.The red line indicates the standard deviation of multiple calculations.Horizontal and vertical error distributions in (b) real and (c) synthetic test datasets, respectively.

Figure 5 .
Figure 5. (a) The absolute error of horizontal and vertical positions in artificial data and real data.The red line indicates the standard deviation of multiple calculations.Horizontal and vertical error distributions in (b) real and (c) synthetic test datasets, respectively.

Figure 7 .
Figure 7. (a) Hologram of a big particle (4-µm PS bead) and small particle (2-µm PS bead).The depth positions of the particles are in order from left to right, from the above to the beneath of the focal plane.Real-time tracking and the tracking data distribution along the z-axis for (b) CenterXFNet-ResNet50, (c) CenterXFNet-ResNet101, and (d) CenterXFNet-DLA34, respectively.The left column shows the tracking results for the big particle.The right column is the tracking result of the small particle.

Figure 7 .
Figure 7. (a) Hologram of a big particle (4-µm PS bead) and small particle (2-µm PS bead).The depth positions of the particles are in order from left to right, from the above to the beneath of the focal plane.Real-time tracking and the tracking data distribution along the z-axis for (b) CenterXFNet-ResNet50, (c) CenterXFNet-ResNet101, and (d) CenterXFNet-DLA34, respectively.The left column shows the tracking results for the big particle.The right column is the tracking result of the small particle.

Table 4 .
Vertical localization accuracy on different training parameters and network structures.

Table 3 .
Horizontal localization accuracy on different training parameters and network structures.