Time-Domain Fluorescence Lifetime Imaging Techniques Suitable for Solid-State Imaging Sensor Arrays

We have successfully demonstrated video-rate CMOS single-photon avalanche diode (SPAD)-based cameras for fluorescence lifetime imaging microscopy (FLIM) by applying innovative FLIM algorithms. We also review and compare several time-domain techniques and solid-state FLIM systems, and adapt the proposed algorithms for massive CMOS SPAD-based arrays and hardware implementations. The theoretical error equations are derived and their performances are demonstrated on the data obtained from 0.13 μm CMOS SPAD arrays and the multiple-decay data obtained from scanning PMT systems. In vivo two photon fluorescence lifetime imaging data of FITC-albumin labeled vasculature of a P22 rat carcinosarcoma (BD9 rat window chamber) are used to test how different algorithms perform on bi-decay data. The proposed techniques are capable of producing lifetime images with enough contrast.

applications such as single molecule detection [23,24], fluorescence lifetime measurements [13,20], optical range finding, optical fiber fault detection, and portable explosives sensing [25]. Recent developments of CMOS SPADs have shown significant improvements in the dead time [26], dark count [27], technology migration to advanced process [28] and pixel miniaturization [29], and quantum efficiency in the longer wavelength region [30]. It is expected high resolution CMOS SPAD arrays for ranging applications [31] will soon be applied to FLIM applications.
Another bottleneck for high-speed lifetime imaging is lifetime calculation. Common FLIM systems usually use iterative linear or non-linear least square methods (LSM), such as Marquardt-Levenberg algorithms, to extract the lifetimes. Although this approach is accurate and suitable for analyzing multi-exponential decays, it is computationally time consuming which makes it unsuitable for real-time applications. It is desirable, therefore, to develop non-iterative simple algorithms to speed up the lifetime calculations while maintaining enough imaging quality. Compared with the LSM, iterative-free gating methods only require two time bins for single-exponential decays [9,13,21], four bins for bi-exponential decays [32,33] or eight bins for multi-exponential decays [34]. The hardware complexity is greatly reduced and the speed is much higher. There are different acquisition schemes for the gating methods. Figure 1(a) shows the traditional sequential acquisition in a pixel, where at least two sub-images are recorded sequentially at different delayed windows with respect to the excited laser pulses to extract the lifetime. The block 'counter' can contain front-end circuits, analog-to-digital converters and accumulators in conventional imaging systems or simply inverters and digital buffers in the latest CMOS SPAD systems. Chang and Mycek applied four time-gates to analyze single-exponential decay data [35]. This approach is slow and sensitive to motion artifacts unless the samples are stationary, and the recorded sub-images are uncorrelated. If a full fluorescence emission histogram is needed for detailed examinations, it will take a significant amount of time to record a large number of sub-images with different delay times [36]. Unlike the sequential acquisition scheme, some researchers applied 'single-shot' acquisition methods by applying multi-channel segmented gated optical intensifiers (GOI) [37] or focusing optically delayed images on different sections of the detector [38,39] to collect sub-images simultaneously. Video-rate FLIM has been demonstrated in GOI FLIM systems. This trades off spatial resolution for imaging speed. Similar to the sequential acquisition scheme, the detectors enabled by the gating signals only work 'part-time'. The background level for the two detectors could be different, and the two recorded events are independent (uncorrelated). Figure 1(c) shows a different acquisition scheme from above. The detector works fully, but with several gated counters collecting photons in different timing windows. The gated counter accumulates only when the leading edge of the detector output is inside its enable period. The gating signals for controlling the counters can be non-overlapping [36][37][38][39][40] or overlapping [41]. Gerritsen et al. applied several non-overlapping gated counters to confocal scanning systems [40]. The system can deal with multiple-decay data. The overlapping gating acquisition, denoted as ORLD, was first proposed by Chan et al. [41] and it was proven to be efficient by running Monte Carlo simulations, but the authors only suggested splitting the PMT signal and sending to two gated counters without actually carrying out the experiments probably because of the complexity and the cost of reconfiguring the existing acquisition systems. Later, Elangovan et al. proposed an overlapping multi-gated algorithm to analyze double-exponential decays and applied Monte Carlo simulations to find suitable experiment settings [32]. For the latest developed CMOS SPAD arrays, however, the ORLD is a good option considering the system complexity and photon efficiency (the photon events collected in the gated counters are correlated if the overlapped proportion is significant). We will for the first time derive the theoretical error equations and test the ORLD methods on the data obtained from a 0.13 μm CMOS SPAD array.
Stoppa's research team combined the acquisition schemes (b) and (c) by binning 4 and 25 CMOS SPADs, respectively, in a 'super' pixel and using gated counters to collect the photon events [11,12] as in Figure 1(d). Similar to the second gating scheme, this technique trades the spatial resolution for imaging speed. It provides faster acquisition in some sensing applications such as high throughput screening. They used non-overlapping time gates to collect the photons and used nonlinear LSM to extract the lifetimes. Figure 1(e) shows a pixel with time-correlated single-photon counting (TCSPC) circuitry. The detector works fully, with a time-to-digital converter (TDC) to record the time tag of each photon. Compared with the hardware for time-gating acquisition, the complexity of the TCSPC systems is much higher. The latest commercial multi-channel systems only contain several PMTs and still require scanning. Recently we have demonstrated a 1024-channel, a 20 k-channel and a 100 Mphoton/s multi-channel FLIM systems with 0.13 μm CMOS SPAD plus in-pixel TDC (10-bit TCSPC) arrays [14][15][16]20]. The parallelism of the proposed systems can be fully exploited by implementing lifetime calculations in hardware [42,43]. Similar to the above-mentioned 'single-shot' systems, the imaging speed of the proposed cameras can operate over 100 f/s.

Time-Domain FLIM Data Analysis
For simplicity, we consider a single-exponential decay with a negligible impulse response function (IRF). Such assumptions allow a proper comparison of various fitting algorithms. Moreover, a single-exponential decay model is still enough to contrast different types of fluorophores. We will discuss the accuracy of FLIM algorithms and compare the performance of various algorithms on bi-exponential decays. Assume that the fluorescence decay function f(t) = Aexp(-t/τ)u(t) with τ being the lifetime and u(t) the step function.

Gating Techniques
Gating algorithms have been widely used in real-time FLIM systems owing to their simplicity and compactness in hardware implementations. The simplest 2-gate RLD (RLD-2) only uses two time gates of the same width. We adopt the F value introduced in [44] to quantify the performance of a FLIM algorithm. The F value is defined as F = N c derived from [21] and [44] as: where x = exp(-h/τ) with h being the gate width. The two gates can be generalized to be overlapping with unequal widths as shown in Figure 2, denoted as GRLD hereafter, where N 1 (= N 1" + N ov ) and N 2 (= N ov + N 2" ) are the total counts of the first (0 < t < h) and second (Sh < t < Rh) gates respectively. Chan et al. [41] used Monte Carlo simulations to get an optimized setting (S = 0.25, R = 12.25) and mentioned that the PMT signal could be split and sent to two gated integrators, but the experiments were not actually carried out. Their "optimized setting" were not based on the theoretical derivations, and we will present theoretically optimized settings that are different from their suggestions. We will also compare its performance with other time-domain algorithms. From the discussions in Section 1, the GRLD can be applied to different gating schemes shown in Figure 1(a-d). We divide these acquisition schemes into two categories. The first category is uncorrelated, where the overlapping parts are uncorrelated, denoted as UGRLD hereafter, because they are recorded in different snapshots or in different gated detectors. The photon events N 1 and N 2 are mutually independent and therefore uncorrelated. The second category is correlated, denoted as CGRLD hereafter, because the overlapping counts, N ov , collected by different gated integrators are from the same photon events. The F-values for the UGRLD and CGRLD can be derived directly from Equation (A5) in the Appendix (using similar derivation methods and notations with [45]) as: Equation (2) allows us to determine an optimized setting without resorting to Monte Carlo simulations. Moreover, setting g(x) = 0 in Equation (A2), we can obtain: where both S −1 and (R-S) are integers. It is easy to build a look-up table for Equation (3) on FPGAs and make the calculations much faster as in [15]. Some researchers used mult-gate RLD (RLD-M) to calculate the lifetimes [35,36,42]: where N j is the count number in the jth time bin (t j < t < t j+1 ), j = 0, …, M-1. Similar to Equation (A2), an average lifetime is obtained from Equation (4) if the fluorescent emission is a multi-exponential decay. The theoretical error equation, Equation (4), only exists when the bins are equally spaced, and it was for the first time derived in [42] considering also the impact of timing jitters and later derived in [35] for a special case (M = 4) incorporating filtering techniques. Neglecting the impact of the TDC jitters, the F-value can be re-written from Equation (25) in [42] as: where h is the bin width. Figure 3(a,b) shows the F-value curves in terms of the ratio of the lifetime over the measurement window for the previously reported and our center-of-mass method (CMM) [15], integration for lifetime extraction method (IEM) [42], Equations (1), (2) and (5). The 256-gate maximum likelihood estimation (MLE) offers the best precision with a very wide optimized window. RLD-2 has its best performance at T ~ 4.8τ (or h = 2.4τ) [21], and it usually requires to know the lifetime range of the samples in advance for tuning a proper gate width. Theoretical (solid lines) and Monte Carlo simulation (marked with circles or crosses) results of the UGRLD and CGRLD show good agreement with Equation (2). For extracting single-decay lifetimes, the performance of RLD-M is only better than the other gating techniques in the region T ~τ and increasing M over eight does not improve the precision any further. This conclusion contradicts the comment stating that the RLD-M is more precise than RLD-2 made in [35]. In their experiments, they applied four time gates on single-exponential decays. The advantage of using the multiple time gate technique, however, is its capability to image multi-exponential decays [11,34] using linear or nonlinear LSMs or to image bi-exponential decays using the fast algorithm proposed by Sharman and Periasamy [32,33]. The performance of UGRLD is worse than that of CGRLD. For the same acquisition time, the gated integrator acquisition schemes shown in Figure 1(c,d) provide a better precision than those in Figure 1(a,b). Chan et al. suggested an optimized setting (S = 0.25, R = 12.25) [41], where the best resolving window is roughly from T ~ 12 to 100τ with the duty cycle of their lifetime measurements being comparatively low. Figure 3(c,d) show two F-value plots with (R-S) fixed and with S fixed respectively. For R = S + 3, a smaller S results in a wider optimized window allowing to resolve smaller lifetimes, but the minimum F value occurs at about S = 0.5. It is obvious that a bigger R with a proper S (around 0.2 to 0.7 considering the lifetime resolvability range) allows resolving lifetimes much less than the measurement window. The lower bound of the resolvable lifetime range, however, is limited by the full width at half maximum of the IRF, so choosing a much bigger R is not realistic. Instead of using the setting suggested by Chan et al., we suggest an optimized setting (S = 0.2 and R = 3.2) from Equation (2) to provide a comparable performance with RLD-2 in T < 5τ while maintaining an enough resolvability up to T ~ 50τ. For a laser repetition rate of 80 MHz, the minimum resolvable lifetime is about 200 ~ 300 ps, in the same order of the timing jitters of the latest developed 0.13 μm CMOS SPADs [22]. The minimum F-value of the CGRLD is about 1.2 and its simplicity in hardware implementation is promising for massive sensor arrays. In Figure 3(b), the IEM-7 (7-gate IEM) [42] and CMM-256 (256-gate CMM) [15] are non-iterative algorithms suitable for TCSPC systems.

Non-Iterative Algorithms Suitable for TCSPC Systems
Although the gating methods introduced above can provide fast lifetime preview and are easy to implement, scientists still prefer to keep the raw timing data for detailed examinations on what really happens in the samples. For gating methods, recording raw timing data is not easy. To obtain a full histogram, a large number of snap shots at different time delays are acquired sequentially. It is very inefficient. To increase the raw data collecting speed, the number of gated integrators can be increased as in [34], but at a cost of system complexity. To enhance the raw data collection rate, we incorporated on-chip high-resolution TDCs into the CMOS SPAD arrays and built 1024-channel and 20k-channel "miniaturized PMT+TCSPC" prototypes [16]. The prototypes can operate in the RAW MODE for collecting the raw data and in the COARSE MODE for high-speed lifetime imaging preview. Due to the slow speed of the widely used iterative based least square methods, several time-domain FLIM algorithms suitable for TCSPC systems (Figure 1(e)) have been proposed to improve the imaging speed [42,43]. We proposed an innovative background-insensitive algorithm suitable for our SPAD prototypes, called IEM. The calculated lifetimes are: where C = [1/3, 4/3, 2/3, …, 4/3, 1/3] from the Simpson's integration rule [42], h is the TDC bin width, and N j is the count number in the jth time bin. The precision and accuracy are detailed in [42]. Compared with commonly used RLD-2 and MLE, IEM is much less sensitive to the background noise [46] Moreover, the lifetime calculations are very simple and suitable for on-FPGA or on-chip implementations. On-FPGA real-time wide-field lifetime imaging has been successfully demonstrated on microfluidic mixing [14]. Figure 3(b) also shows an F-value curve for the 7-gate IEM with a total count of 1024 (IEM is a biased estimator and the quantization error which comes to effect at τ < 0.1T (for M = 7) is included in the F-value curve as in [42]. Without considering the impact of the background noise, the error performance of IEM-7 is similar to RLD-2. The effect of increasing M in IEM is similar to increasing R in GRLD; both cause their F-value curves to shift towards the smaller τ/T region. To further enhance the lifetime imaging speed on our SPAD prototypes, we proposed another hardware implementation algorithm based on center-of-mass methods. The calculated lifetime defined in [43] is re-written as: where i D is the 10-bit TDC output of the ith captured photon and Ω(·) is a 1D look-up table for calibrating the error term in Equation (7). Its precision and accuracy Equations were derived in [15], and its F-value curve (shown in Figure 3(b)) considering the accuracy term for 256-gate CMM with N c = 1,024 showing great photon effectiveness with a wide optimization range. We applied Equation (7) to achieve high-speed lifetime sensing [20,43] and imaging [15]. The lifetime calculations are mostly carried out on FPGAs. To demonstrate its performance, a widefield microscope was adapted to accommodate the SPAD array. A dilute solution of 10 µm fluorescent beads (G1000, Duke Scientific, Palo Alto, CA, USA) was used to simulate cells flowing through the system. The fluorescent emission was captured by the SPAD camera and the delay of every captured photon with respect to the laser pulses was calculated by the in-pixel TDC. The 10-bit CMM lifetime codes were generated from the FPGA through Equation (8), and the lifetimes were easily calculated with the characterization data of the TDC array. Figure 4(a) shows CMM codes at a video frame rate of 50 fps, and Figure 4(b) shows a single frame of the estimated lifetime video. The size of the bead is about 10 μm, and the speed of beads is 200 μm/s. The experimental settings are optimized for the TDC sub-array with the resolution h = 78 ps, Rows 1−16 [22], and the lifetime codes obtained from the other part (h = 160 ps, Rows 17−32) are normalized. During one frame time the bead moves approximately 4 μm, 40% of its width, producing non-trivial motion blur. The average lifetime is around 2 ns, in a good agreement with the lifetime obtained by using the burst integrated fluorescence lifetime (BIFL) measurement with a Becker & Hickl card (SPC850). The microscope was based on a Nikon Eclipse Ti-E inverted microscope platform. Illumination was via a polarization maintaining fibre to the Nikon TIRF attachment using a 4 W supercontinuum laser (Fianium SC400) to provide picosecond laser pulses at 20 MHz. Excitation wavelengths are selected using bandpass filters prior to launching into the fibre. In this instance a 488 nm bandpass filter was used (470 ± 22 nm, Chroma Inc, Bellows Falls, VT, USA) unless stated otherwise. Fluorescence lifetime images were acquired with a Nikon 40× Plan Fluor Objective (0.75 NA) using the 0.13 μm CMOS 32 × 32 SPAD+TDC chip [22] arranged in the image plane of the microscope side port. The microfluidic system consisted primarily of a Mitos T-Junction Chip (100 µm channel) connected via 1/16" Teflon tubing to a Mitos P-pump (The Dolomite Centre, Royston, UK). The instrument response function (IRF) was measured and found to have a FWHM of 0.6 ns. The analog version of Equation (7), denoted as analog mean-delay (AMD) hereafter, has been reported earlier using stand-alone components for lifetime sensing applications [47]. Recently, an AMD instrumentation considering the impact of IRFs and limited measurement window problems has been carried out almost at the same time with our digital CMM by Kim [48,49]. Their algorithms can work efficiently with low-cost single-channel PMTs, and it has been demonstrated in a confocal scanning microscope [50]. Padilla-Parra et al. also used Equation (7) to calculate the minimal fraction of interacting donor (MFD) in florescence resonance energy transfer (FRET) experiments [51,52]. The measurement window of the MFD, however, should be much larger than the largest lifetime component; otherwise the contrast capability would be lost, similar to the problems revealed in [15] and [49]. By automatically calibrating such problems, the MFD can be easily applied to our SPAD systems for FRET applications as well. Recently, a Bayesian FLIM analysis method was proposed, and it outperformed the commonly used MLE and LSM especially on noisy single-exponential data sets [53,54]. This algorithm is promising especially in single-cell cytometry. The Bayesian methods for multi-exponential decays are under development.

Error Performance of FLIM Algorithms on Data Collected by CMOS SPAD Arrays
To compare the error performance of different FLIM algorithms, widefield lifetime images of a uniform aqueous solution of Rhodamine B were taken in different acquisition time. The imaging system was set up on a Nikon TE2000U inverted microscope. The excitation source was a PicoQuant pulsed diode laser with a wavelength of 470 nm coupled through the epi-fluorescence port of the microscope using a Nikon B-2A filter cube. The laser pulse rate is 20 MHz and the maximum power reaching the back focal plane of the objective is about 90 µW. The sample was imaged onto the 32 × 32 SPAD camera directly attached to one of the camera ports using a 20× objective (Nikon Plan Apo, 20×, NA 0.45). The TDC resolution is set to be 78 ps with the on-chip phase-locked loop enabled [22]. Two measurement windows of Mh ~ 4.1τ and Mh ~ 17τ were chosen to calculate lifetimes. Figure 5(a,b) show the precision curves in terms of photon counts for different algorithms. The ideal curve is the theoretical precision of MLE (or CMM) without considering system non-idealities. CMM and MLE behave almost the same in both plots, and they are about 0.5 ~ 1.5 dB away from the ideal curve. The first window Mh/τ ~ 4.1 is optimized for RLD-2, where RLD-2, IEM, and CGRLD (S/R = 0.2/3.2) work equally well. Figure 5(b) shows the reason why we need a fast FLIM algorithm other than RLD-2. The CGRLD with S/R = 0.25/12.25 still performs worse than the CGRLD with S/R = 0.2/3.2 making it inefficient in photon collecting due to low duty cycle operations (optimized window > 17τ). The conclusions drawn from Figure 5(a,b) are in good agreement with the theoretical Figure 3(a,d).

Contrast Capability of Algorithms on Bi-Exponential Decays
Assume a bi-exponential decay f(t) = A 1 exp(-t/τ 1 ) + A 2 exp(-t/τ 2 ) with τ 1 , τ 2 being the lifetimes and A 1 , A 2 being the pre-scalars. Although the above FLIM algorithms are for single-exponential data, it is worth comparing the average lifetimes produced for bi-decay data: The average lifetimes defined by IEM [42], RLD-2 [21], CMM [15], and CGRLD are: , ,  T  T  T   T  T  T  T  T   T   T  T  T where x 1 = exp(-h/τ 1 ), x 2 = exp(-h/τ 2 ), Ω(·) is the 1-D look up table described in [15], and h = T/M for IEM/CMM and h = T/(S +R) for CGRLD. From Equation (10), the average lifetime obtained by IEM will be almost the same with the commonly used Equation (9) when M >> 1. The conclusion will be the same for CGRLD for R >> 1. Figure 6 shows the average lifetimes versus A 1 /(A 1 + A 2 ) obtained by Equations (  To demonstrate how the above-mentioned single-exponential FLIM algorithms perform on bi-exponential data, we apply them to lifetime imaging data obtained by multi-photon FLIM. The experimental procedure is given in full elsewhere [55]. Briefly, early generation transplants of the P22 rat carcinosarcoma were grown in transparent window chambers, surgically implanted into the fascial layer of a dorsal skin flap of male BD9 rats [56,57]. For imaging of the blood vessel architecture, the molecular probe FITC conjugated Bovine serum albumin was injected intravenously as a blood poolcontrast agent. After 100 minutes the FITC-BSA was observed to diffuse from the vasculature and the FLIM image taken. The intensity image in Figure 7(a) shows very little contrast between the vessels. and the tumor bulk. Compared with the intensity image, the lifetime images Figure 7(b-f) obtained by 150-bin CMM, 51-bin IEM, Equation (9), RLD-2, and CGRLD show much better contrast.   concentrate here on the derivation of σ x . With the variations σN j in the recorded counts N j (j = 1, 2), the calculated x will contain a variation σ x with respective to its mean value x accordingly, ˆx x x σ = + .