#
Single-Photon Tracking for High-Speed Vision^{ †}

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Motion Detection

_{i}(i = 1, 2, …, n) will generally feature considerable photon noise, making it difficult to detect motion (or other intensity changes in the image) at the level of individual jots. We therefore use a kernel (or “cubicle”) of size {N

_{x}, N

_{y}, N

_{t}}, designed to be small enough to capture spatial and temporal variations in the image scene, to compose aggregated “test” frames I

_{k}(k = 1, 2, …, n/N

_{t}). The kernel is applied in an overlapping manner in space, and a non-overlapping in time. For the analysis that follows, we shall presume that jots have a uniform photon detection efficiency η, and furthermore dark counts are negligible compared with true photon detections. Assuming uniform illumination across a cubicle, and following the analysis of Reference [13], the value of each resulting pixel will be a binomial count from M = N

_{x}× N

_{y}× N

_{t}trials and a success probability P

_{x,y,k}= 1 − e

^{−ηH}, where H is the quanta exposure (or mean photon arrivals/jot during the bit-plane exposure time τ) corresponding to the cubicle. Thus, confidence bounds can be attached to each pixel I

_{x,y,k}, as to the “true” underlying photon arrival rate. More precisely, we calculate an (approximate) confidence interval on the estimated, underlying success probability, based on the pixel value m of I

_{x,y,k}, using the Agresti-Cull method [16]:

_{k}, mapping statistically significant changes in pixel values between “test” frames I

_{k}:

_{k}by comparing I

_{k}against a fixed reference frame, say I

_{0}, representing the background of the scene, if such a frame is available.

#### 2.2. Clustering and Object Tracking

_{k}frames will thus show clouds of points (pixels of value 1 or −1) corresponding to moving objects in the scene. We can separate individual objects by applying Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clustering [17] to these points (using prior knowledge of the likely size of moving objects), which also identifies and filters out “outlier” points in D

_{k}resulting from photon noise. This enables regions of interest (bounding boxes, with a level of padding) to be determined around detected object motions in D

_{k}. We then take each region of interest in turn (index j), and, under the assumption that the detected objects may be modeled as planar objects in three-dimensional (3D) space (which can be a reasonable approximation even for non-planar objects if they are at a sufficient distance from the camera and any rotation is restricted to around the depth axis), we estimate the transformation T

_{j,k}between successive D

_{k}frames to quantify the motion (Figure 2). The transformation matrices T

_{j,k}may be computed iteratively, by optimizing a similarity metric (e.g., via a gradient descent approach, as in Matlab’s imregtform function [18]). Transformations of varying degrees of freedom may be assumed, such as rigid transformation (accounting for linear motion and rotation), similarity transformation (also including scaling, i.e., the object moving closer to/further away from the camera) or a projective transformation (if there is a change in perspective). For similarity transformation, T

_{j,k}is of the form:

_{j,k}is possible (when D

_{k}is computed with respect to a fixed reference frame). We ignore the polarity of the pixel values in D

_{k}, and for each object j, compute the centroid {X

_{j}, Y

_{j}} of points within the corresponding bounding box. We then estimate the translation of the object by calculating the change in {X

_{j}, Y

_{j}} between successive D

_{k}frames. Scaling is obtained by calculating the mean Euclidian distance of points from {X

_{j}, Y

_{j}} (again tracking how this changes from frame to frame). For estimating rotation, we consider the spatial variance of the points in x and y (V

_{x}, V

_{y}) as, using basic trigonometry, it can be shown (for a general point cloud) that V

_{x}− V

_{y}= Ccos(2θ + φ), where C and φ are constants and θ is the orientation with respect to the Cartesian axes. We note that tracking (and in particular centroiding) based on maps of changed pixel values is a very common approach in machine vision [19]. The difference here is the acquisition of photon count data (giving rise to the pixel thresholding criteria in Equation (2)), and the requirement for the full two-dimensional (2D) motion (including, for example, the orientation) of the object to be tracked for reconstruction purposes.

_{k}(or in when a newly emerging motion is identified), between which the boxes are adapted according to the estimated motion of the enclosed object.

#### 2.3. Reconstruction of Objects and Background

_{i}, at the native resolution of the camera, with minimal motion artefacts.

#### 2.4. Practical Implications

_{x}, N

_{y}, N

_{t}} for generating the test frames is influenced by several factors. Spatially, the size of the cubicle should not exceed the size of salient features on the object, otherwise these will be averaged out, making it harder to ascertain the motion of the object. Similarly, the level of aggregation in time (N

_{t}) should be set so as to capture the motion with adequate temporal detail. At the same time, the total number of jots aggregated (N

_{x}× N

_{y}× N

_{t}) must be sufficiently large, given the level of contrast between the object and the background (or within the object), to allow pixel changes arising from the motion to be detected with high statistical certainty. This can be assessed numerically, by determining the aggregation so that a given change in the underlying photon rate H (or normalized exposure) of test frame pixels is flagged (as per Equation (2)) with a certain sensitivity (for simplicity, we assume here a photon detection efficiency of η = 100%). Figure 4 shows the results for a change from H

_{1}to H

_{2}, with a range of photon rates, from 0.03 to 3 mean detected photons per jot per exposure being considered in each case. As expected, the closer H

_{1}and H

_{2}are, the larger the aggregation required to discriminate between them. In the case where there is prior information on the object, the results can serve as a guide for selecting N

_{t}(once the spatial aggregation N

_{x}× N

_{y}has been fixed). One can also envisage an iterative means of choosing N

_{t}, whereby a low level of temporal aggregation is first carried out, which is then progressively increased until motion is detected in the scene.

## 3. Results

#### 3.1. Simulated Data

_{car}and H

_{back}, respectively. An extensive set of image sequences were then generated to cover a range of H

_{car}and H

_{back}. Next, the tracking-reconstruction algorithm of Section 2 was applied to each sequence, and the quality of the results assessed (object transformations were estimated using the monomodal option of Matlab’s imregtform function, with the assumption of a similarity transformation, and the settings MaximumIterations = 300, MinimumStepLength = 5 × 10

^{−6}, MaximumStepLength = 5 × 10

^{−3}, RelaxationFactor = 0.9). Figure 5b plots the 2D correlation coefficient R between the reconstructed and the still car, as a function of H

_{car}and H

_{back}

_{.}Good correspondence is seen with the results of Figure 3, in that for combinations of H

_{car}and H

_{back}that are distinguishable according to said figure (given the cubicle size {N

_{x}, N

_{y}, N

_{t}} = {8, 8, 8} used here), good tracking is indeed obtained, leading to a reconstruction with R > 0.9. Example images are given in Figure 6 for the case of H

_{car}= 0.35 and H

_{back}= 0.15. Figure 6a shows a single bit-plane; summing the sequence of bit-planes leads to the image in Figure 6b, where the car is unrecognizable due to motion blur. Running the algorithm leads to the reconstructed image of Figure 6c, which compares well to the image of the still car (Figure 6d), with R = 0.943.

#### 3.2. Fan and Car Sequence

#### 3.3. Table Tennis Ball

#### 3.4. Camera Shake Compensation

_{x}, N

_{y}, N

_{t}} = {1, 1, 250}, ensuring a suitably high level of bit-depth for the purposes of re-alignment (despite the long sum in time invariably leading to motion blur, realignment is still possible). The compensated images frames shown in the Figure 10 (panels (c) and (f)) are seen to be noticeably sharper than the uncompensated (test) frames (panels (b) and (e)). This is backed up by a clear increase in the indicated 2D correlation coefficients R, calculated with respect to the reference images (panels (a) and (d)), obtained with the camera still (R is normalized using the correlation of image frames in still conditions to account for the inherent frame-to-frame variability due to photon shot noise).

## 4. Discussion

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## Appendix A

**Figure A1.**Results of plane test: (

**a**) single raw bit-plane (100 μs); (

**b**) sum of N = 250 bit-planes (still frame from video rate image sequence) (

**c**) reconstructed image of plane (from N = 250 bit-planes); (

**d**) image frame from final output sequence obtained using tracking scheme. A cubicle of {N

_{x}, N

_{y}, N

_{t}} = {8, 8, 32} was applied to create the test frames (not shown here), and difference frames were computed with d = 2.

## References

- Fossum, E.R.; Ma, J.; Masoodian, S.; Anzagira, L.; Zizza, R. The quanta image sensor: Every photon counts. Sensors
**2016**, 16, 1260. [Google Scholar] [CrossRef] [PubMed] - Masoodian, S.; Ma, J.; Starkey, D.; Wang, T.J.; Yamashita, Y.; Fossum, E.R. Room temperature 1040fps, 1 megapixel photon-counting image sensor with 1.1 um pixel pitch. Proc. SPIE
**2017**, 10212, 102120H. [Google Scholar] - Chen, B.; Perona, P. Vision without the Image. Sensors
**2016**, 16, 484. [Google Scholar] [CrossRef] [PubMed] - Chan, S.H.; Elgendy, O.A.; Wang, X. Images from Bits: Non-Iterative Image Reconstruction for Quanta Image Sensors. Sensors
**2016**, 16, 1961. [Google Scholar] [CrossRef] [PubMed] - Gyongy, I.; Dutton, N.; Parmesan, L.; Davies, A.; Saleeb, R.; Duncan, R.; Rickman, C.; Dalgarno, P.; Henderson, R.K. Bit-plane processing techniques for low-light, high speed imaging with a SPAD-based QIS. In Proceedings of the 2015 International Image Sensor Workshop, Vaals, The Netherlands, 8–11 June 2015. [Google Scholar]
- Gyongy, I.; Davies, A.; Dutton, N.A.; Duncan, R.R.; Rickman, C.; Henderson, R.K.; Dalgarno, P.A. Smart-aggregation imaging for single molecule localisation with SPAD cameras. Sci. Rep.
**2016**, 6, 37349. [Google Scholar] [CrossRef] [PubMed] - Fossum, E.R. Modeling the performance of single-bit and multi-bit quanta image sensors. IEEE J. Electron Devices Soc.
**2013**, 1, 166–174. [Google Scholar] [CrossRef] - Elgendy, O.A.; Chan, S.H. Optimal Threshold Design for Quanta Image Sensor. arXiv
**2017**, arXiv:10.1109/TCI.2017.2781185. [Google Scholar] - Bascle, B.; Blake, A.; Zisserman, A. Motion deblurring and super-resolution from an image sequence. In Proceedings of the 4th ECCV ’96 European Conference on Computer Vision, Cambridge, UK, 15–18 April 1996; pp. 571–582. [Google Scholar]
- Nayar, S.K.; Ben-Ezra, M. Motion-based motion deblurring. IEEE Trans. Pattern Anal. Mach. Intell.
**2004**, 26, 689–698. [Google Scholar] [CrossRef] [PubMed] - Lucas, B.D.; Kanade, T. An Iterative Image Registration Technique with an Application to Stereo Vision. In Proceedings of the 1981 DARPA Image Understanding Workshop, Washington, DC, USA, 23 April 1981; pp. 121–130. [Google Scholar]
- Barron, J.L.; Fleet, D.J.; Beauchemin, S.S. Performance of optical flow techniques. Int. J. Comput. Vis.
**1994**, 12, 43–77. [Google Scholar] [CrossRef] - Aull, B. Geiger-Mode Avalanche Photodiode Arrays Integrated to All-Digital CMOS Circuits. Sensors
**2016**, 16, 495. [Google Scholar] [CrossRef] [PubMed] - La Rosa, F.; Virzì, M.C.; Bonaccorso, F.; Branciforte, M. Optical Image Stabilization (OIS). Available online: www.st.com/resource/en/white_paper/ois_white_paper.pdf (accessed on 31 October 2017).
- Gyongy, I.; Al Abbas, T.; Dutton, N.A.; Henderson, R.K. Object Tracking and Reconstruction with a Quanta Image Sensor. In Proceedings of the 2017 International Image Sensor Workshop, Hiroshima, Japan, 30 May–2 June 2017; pp. 242–245. [Google Scholar]
- Agresti, A.; Coull, B.A. Approximate is better than “exact” for interval estimation of binomial proportions. Am. Stat.
**1998**, 52, 119–126. [Google Scholar] - Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 226–231. [Google Scholar]
- Imregtform—Mathworks. Available online: https://uk.mathworks.com/help/images/ref/imregtform.html (accessed on 31 October 2017).
- Myler, H.R. Fundamentals of Machine Vision; SPIE Press: Bellingham, WA, USA, 1999; p. 87. [Google Scholar]
- Arbelaez, P.; Maire, M.; Fowlkes, C.; Malik, J. Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell.
**2011**, 33, 898–916. [Google Scholar] [CrossRef] [PubMed] - Chan, T.F.; Vese, L.A. Active contours without edges. IEEE Trans. Image Process.
**2001**, 10, 266–277. [Google Scholar] [CrossRef] [PubMed] - Dutton, N.A.; Parmesan, L.; Holmes, A.J.; Grant, L.A.; Henderson, R.K. 320 × 240 oversampled digital single photon counting image sensor. In Proceedings of the 2014 Symposium on VLSI Circuits Digest of Technical Papers, Honolulu, HI, USA, 10–13 June 2014. [Google Scholar]
- DBSCAN Algorithm-Yarpiz. Available online: http://yarpiz.com/255/ypml110-dbscan-clustering (accessed on 31 October 2017).
- Hseih, B.C.; Khawam, S.; Ioannis, N.; Muir, M.; Le, K.; Siddiqui, H.; Goma, S.; Lin, R.J.; Chang, C.H.; Liu, C.; et al. A 3D Stacked Programmable Image Processing Engine in a 40 nm Logic Process with a Detector Array in a 45nm CMOS Image Sensor Technologies. In Proceedings of the 2017 International Image Sensor Workshop, Hiroshima, Japan, 30 May–2 June 2017; pp. 4–7. [Google Scholar]
- Nose, A.; Yamazaki, T.; Katayama, H.; Uehara, S.; Kobayashi, M.; Shida, S.; Odahara, M.; Takamiya, K.; Hisamatsu, Y.; Matsumoto, S.; et al. A 1ms High-Speed Vision Chip with 3D-Stacked 140GOPS Column-Parallel PEs for Diverse Sensing Applications. In Proceedings of the 2017 International Image Sensor Workshop, Hiroshima, Japan, 30 May–2 June 2017; pp. 360–363. [Google Scholar]
- Takahashi, T.; Kaji, Y.; Tsukuda, Y.; Futami, S.; Hanzawa, K.; Yamauchi, T.; Wong, P.W.; Brady, F.; Holden, P.; Ayers, T.; et al. A 4.1 Mpix 280fps stacked CMOS image sensor with array-parallel ADC architecture for region control. In Proceedings of the 2017 Symposium on VLSI Circuits, Kyoto, Japan, 5–8 June 2017; pp. C244–C245. [Google Scholar]
- Masoodian, S.; Ma, J.; Starkey, D.; Yamashita, Y.; Fossum, E.R. A 1Mjot 1040fps 0.22 e-rms Stacked BSI Quanta Image Sensor with Cluster-Parallel Readout. In Proceedings of the 2017 International Image Sensor Workshop, Hiroshima, Japan, 30 May–2 June 2017; pp. 230–233. [Google Scholar]
- Inside iPhone 8: Apple’s A11 Bionic Introduces 5 New Custom Silicon Engines. Available online: http://appleinsider.com/articles/17/09/23/inside-iphone-8-apples-a11-bionic-introduces-5-new-custom-silicon-engines (accessed on 31 October 2017).

**Figure 1.**Synthetic Quanta Image Sensor (QIS) images of a car, with a maximum photon rate of 0.2 photons/pixels/bit-plane, typical of low light conditions: (

**a**) Single bit-plane exposure (10 µs); (

**b**) Sum of 50 bit-plane exposures of a car moving at 300 km/h; (

**c**) Image b after motion deblurring using the Wiener Filter (deconvwnr function in Matlab, with the assumption of a noise-power-to-signal-power ratio of 0.05); (

**d**) Sum of 50 bit-planes for static car.

**Figure 2.**Illustration of the clustering/tracking approach: computing difference frames from the test frames give clouds of points indicating moving objects. These clouds are then clustered and bounding boxes are established. The estimated transformations between corresponding point clouds on successive difference frames give the trajectory of the relevant object.

**Figure 3.**Block diagram indicating the steps in the tracking-reconstruction technique. The input is a sequence of bit-planes L

_{i}capturing a high-speed object. The scheme tracks the motion of this object (as defined by T

_{i}) and outputs a higher bit-depth image sequence, G

_{i}.

**Figure 4.**Heat map of the temporal aggregation N

_{t}required (in terms of powers of two) to detect a change of photon rate from H

_{1}to H

_{2}with >90% sensitivity. The photon rates are specified in terms of the mean detected photons per jot over a sub-exposure (bit-plane) and are presented on logarithmic scales. A spatial aggregation of 8 × 8 is assumed, with each data point on the heat map being obtained using Monte Carlo simulations (under the assumption of Poisson statistics) by generating 10,000 pairs of aggregated pixel realizations (from H

_{1}and H

_{2}) at different levels of N

_{t}, and applying Equation (2).

**Figure 5.**Summarized results from synthetic data set: (

**a**) Two-dimensional (2D) car model (mask) used to generate data set. The car moves along a circular arc, with an angle to the vertical of 0.002 i

^{2}+ 0.2 i and a radius of r = 250 jots (where i is the index of the bit-plane), and grows in size at a rate of 0.5% per bit-plane. Shown are the initial (i = 1) and final (i = 72) positions of the car; (

**b**) Heat map of the 2D correlation coefficient R between the reconstructed and still car. The coefficient R is calculated by taking the noise-free image sequence (as in panel a), and calculating the correlation between the image of the car (at i = 1), and the sum of the re-aligned images (i = 1, ..., 72), based on the trajectory extracted from the synthesized (i.e., randomized with photon noise) bit-plane sequence. Test cases where tracking was not obtained are indicated by R = 0. The tracking-reconstruction scheme was run with the following parameters: {N

_{x}, N

_{y}, N

_{t}} = {8, 8, 8} to create the test frames, d = 1 for producing the difference frames.

**Figure 6.**Results for synthetic data with H

_{car}= 0.35 and H

_{back}= 0.15: (

**a**) Single bit-plane (i = 1); (

**b**) Sum of bit-planes (i = 1, …, 72) of moving car; (

**c**) Reconstructed image of moving car (R = 0.926); (

**d**) Sum of bit-planes for still car. All images show a region of interest of 230 × 90 from the full 400 × 240 array size.

**Figure 7.**Results for synthetic data with both car and background featuring identical dotted patterns with photon rates H

_{1}= 0.35 (in dot) and H

_{2}= 0.05 (elsewhere). (

**a**) Bit-plane at time i = 1; (

**b**) Bit-plane at time i = 36; (

**c**) Reconstructed vehicle (shown over a region of interest of 230 × 90). The quality of the reconstruction is R = 0.960.

**Figure 8.**Results of fan and car test: (

**a**) Single bit-plane exposure (100 μs); (

**b**) Sum of N = 250 bit-planes (still frame from video rate image sequence, hot pixel compensation has been applied); (

**c**) Test frame created by aggregating bit-planes using a kernel of size {N

_{x}, N

_{y}, N

_{t}} = {8, 8, 16}; (

**d**) Difference frame computed from two test frames (d = 2); (

**e**) Result of DBSCAN clustering using [23]; (

**f**) Reconstructed images of car and fan (from N = 250 bit-planes); (

**g**) Reconstructed background with object trajectories, over one rotation of the fan, overlaid (each division representing 16 bit-planes = 1.6 ms); (

**h**) Frame from final output sequence. We note the much-improved sharpness compared with image (

**b**).

**Figure 9.**Results of time-gated imaging of fast-falling (≈10 m/s) table tennis ball, in terms of consecutive bit-plane exposures (global shutter, 100 μs), obtained using (

**a**) a long time gate; (

**b**) a short time gate; and (

**c**) an image of the tennis ball as reconstructed from 32 short gate bit-planes through the application of the tracking scheme. A Nikon f/1.4 50 mm objective was used.

**Figure 10.**Results of camera shake test, in terms of video-rate image frames (each composed from N = 250 bit-planes): (

**a**) truck, still camera; (

**b**) truck, shaken camera (R = 0.79); (

**c**) compensated version of image b (R = 0.99); (

**d**) skyline, still camera; (

**e**) skyline, shaken camera (R = 0.75); (

**f**) compensated version of image of d (R = 0.96). R is the 2D correlation coefficient with respect to the image from the still camera (obtained with a tripod).

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Gyongy, I.; Dutton, N.A.W.; Henderson, R.K.
Single-Photon Tracking for High-Speed Vision. *Sensors* **2018**, *18*, 323.
https://doi.org/10.3390/s18020323

**AMA Style**

Gyongy I, Dutton NAW, Henderson RK.
Single-Photon Tracking for High-Speed Vision. *Sensors*. 2018; 18(2):323.
https://doi.org/10.3390/s18020323

**Chicago/Turabian Style**

Gyongy, Istvan, Neale A.W. Dutton, and Robert K. Henderson.
2018. "Single-Photon Tracking for High-Speed Vision" *Sensors* 18, no. 2: 323.
https://doi.org/10.3390/s18020323