MATRIX16: A 16-Channel Low-Power TDC ASIC with 8 ps Time Resolution

: This paper presents a highly conﬁgurable 16-channel TDC ASIC designed in a commercial 180 nm technology with the following features: time-of-ﬂight and time-over-threshold measurements, 8.6 ps LSB, 7.7 ps jitter, 5.6 ps linearity error, up to 5 MHz of sustained input rate per channel, 9.1 mW of power consumption per channel, and an area of 4.57 mm 2 . The main contributions of this work are the novel design of the clock interpolation circuitry based on a resistive interpolation mesh circuit and the capability to operate at different supply voltages and operating frequencies, thus providing a compromise between TDC resolution and power consumption.


Introduction
Time-of-Flight (ToF) measurement is one of the major challenges in high-energy physics experiments [1], medical imaging [2], mass spectrometry [3], and Laser Imaging Detection and Ranging (LiDAR) [4], among others. Precise timing measurements allow computing the distance that a particle traveled and thus identifying tracks, performing coincidence measurements, or determining the distance to objects. On the other hand, Time-over-Threshold (ToT) provides the pulse width information, which has many applications: measuring the deposited energy of the detected particles [5] or applying time-walk corrections [6], among others.
Our research group has been working for years on fast-timing ASIC designs for Positron Emission Tomography (PET) applications [7,8]. HR-FlexToT ASIC provides very good timing performance: 60 ps Single-Photon Time Resolution (SPTR) (using a Hamamatsu S13360-3050 MPPC: 3 × 3 mm 2 , 50 µm 2 cell) and low power consumption (<3.5 mW/ch) [8]. The outputs of this chip are Continuous-Time Binary-Valued (CTBV), so that external equipment is required to perform fine timing measurements. The objective of MATRIX16 ASIC is to digitize these outputs with the lowest power consumption possible (<10 mW per channel), to minimize scalability issues when building large PET systems with thousands of channels. Assuming that modern SiPMs offer better than a 100 ps timing resolution [9], TDC resolution should be better than 20 ps to not degrade timing performance substantially.

TDC Working Principle
A TDC is a device that converts a binary input pulse event into its digital representation. In ToF applications, the internal TDC counters start counting synchronously, and the rising (or falling) edge of the incoming pulse latches the internal counter value (absolute time measurement). In ToF+ToA applications, both rising and falling edges are captured, so that the pulse width can also be computed. In start/stop TDCs, a time interval between two events is measured (relative time measurement).
Time digitization is typically performed by two counter levels: coarse and fine. The coarse counter counts the number of periods of the system clock, and the number of bits of this counter determines the dynamic range of the TDC. The resolution is typically in the ns range. The fine counter stage interpolates the system clock, and therefore, the resolution is scaled down to the picosecond range. This second level is one of the most critical parts of the design, and there are many ways to implement it, depending on the application requirements, technology, cost, scalability, etc.

State-of-the-Art TDCs
Currently, there are two trends in TDC designs: FPGA based and ASIC based. FPGA TDCs use the fastest delay element (typically the carry logic circuitry) in the device to use it as a Tapped Delay Line (TDL), while ASIC TDCs can be customized for a given purpose.
In [10], the main contribution of the author was a bin realignment method and a dual-sampling method of a TDL implemented on an FPGA (two channels), aiming to reach the limit of Xilinx Ultra-Scale FPGA delay granularity. The achieved resolution was 3.9 ps, and the dead time was only 4 ns. In [11], the authors proposed using the FPGA routing resources (1024 paths) as delay elements instead of using the traditional TDL method, achieving a 7.4 ps time bin, a DNL of 0.74 LSB, an INL of 1.57 LSB, and 0.92 LSB of jitter. The reported power consumption was 23 mW (single channel). Another alternative to reduce the bin size in FPGAs (as well as in ASICs) is to combine the information from multiple TDLs, leading to a stochastic TDC [12,13]. In this technique, the bin size scales down with √ N TDL , while the power consumption almost scales by N TDL . In [12], a TDC bin size of 1.15 ps was achieved (N TDL = 20) and a 3.5 ps single-shot precision. Moreover, the author proposed a temperature offset cancellation to compensate bin size variations caused by temperature drifts. Lastly, it is important to remark that from the cited FPGA TDC works, only [11] reported the power consumption, which suggests that this feature is not competitive on FPGAs.
ASIC-based TDC's most common fine interpolation stage implementations can be divided into three groups: • Flash: This consists of a clock delay line where each stage is sampled by a flip-flop controlled by the input hit edge. Flip-flop outputs are then encoded into a binary counter. The number of stages must be enough to cover, at least, a half period of the reference clock. This implementation is dead time free and suitable for applications with high conversion rates. However, TDC resolution is limited by the minimum delay element, which depends on the CMOS process technology. In [14], subdelay was achieved by interpolating consecutive delay stages with N resistors in between. In [15], an array of adjustable scaled load capacitors was used to achieve subdelay; • Vernier: This aims to improve the TDC resolution beyond the minimum delay element. In this case, two delay lines oscillate at periods t1 and t2, with an initial phase shift φ 0 corresponding to the fine interpolation delay to be measured [16][17][18]. Thus, the faster delay line catches the slower one after φ 0 /∆ T periods, being the TDC resolution ∆ T = t2 − t1. The number of logic resources tends to be lower with respect to the Flash implementation, but the dead time (=t Clk 2 /∆ T ) dramatically increases as ∆ T scales down or the dynamic range increases. To expand the range without penalizing the resolution, Reference [19] proposed a taped 2D Vernier ring TDC, achieving a 1 ps timing resolution; • Time Difference Amplification (TDA): The pulse corresponding to the time difference between the input hit edge and the reference clock (≤1 clock period) is amplified by an analog time stretcher, and hence, the resulting pulse can be converted with a lower resolution TDC [20]. The main challenges of this implementation are the linearity and dead time, which constrains the amplification factor.

TDC Implementation Choice
The choice of the ASIC TDC fine interpolation stage architecture mainly depends on the conversion rate (maximum allowable dead time), resolution (bin size), power consumption, and technology node. The Flash architecture was chosen in this work since the pulse width of the incoming signal was in the few ns range, and each hit edge required a timing measurement (ToT). Even using two independent conversion stages (one for each hit edge), the maximum acceptable dead time (100 ns) would imply oscillating at more than 1 GHz in order to achieve a 10 ps time bin, which is challenging in 180 nm technology.

Overview
In this work, we present a 16-channel TDC ASIC prototype that provides ToF and ToT measurements. This chip is an evolution of MATRIX4 TDC ASIC [21], a four-channel TDC that provides ToF measurements using a patented technology [22]. The main contribution of this work is the Resistive Interpolation Mesh Circuit (RIMC), an improved Flash TDC architecture that allows improving the TDC resolution beyond the minimum delay element by using a combination of resistive interpolation and stochastic interpolation.
This paper is organized as follows: in Section 2, the building blocks of the chip are described; Section 3 describes the experimental setup; Section 4 shows the chip measurement results; Section 5 compares the ASIC performance of this work with state-of-the-art TDCs; and finally, in Section 6, the conclusions are drawn.

Building Blocks
MATRIX16 TDC receives 16 hit signals from a given frontend and converts each input pulse into two short pulses, one per edge. These short pulses latch the internal value of a coarse counter and the state of an array of coupled ring oscillators. The first gives the number of integer clock periods, while the second interpolates the clock phase. The captured data are then encoded, synchronized, buffered, and finally, serially transmitted with an LVDS driver. The block diagram of MATRIX16 is described in Figure 1, and the chip floor plan is shown in Figure 2.
As seen in the floor plan ( Figure 2), the TDC core consists of a group of four clusters and the SPI slave interface block, which allows modifying the ASIC configuration via software. Each cluster manages four channels, comprising the following building blocks:

Edge Detector
The aim of this block is to convert an input hit into two narrow pulses and thus measure both the rising and falling edges of this input hit. The XNOR operation between the input hit signal and itself with a very short delay (∼300 ps) is performed. The output of this operation will produce a very short (∼300 ps) active low pulse every time an edge occurs at the TIME input. This signal is buffered and then sent to the TCM and Coarse Counter, which will trigger the TDC conversion.
Moreover, this block implements a filter that allows the user to ignore those TIME pulses narrower than a certain pulse width (programmable) and, in this way, avoid very short pulses produced by dark noise and afterpulsing on the SiPMs [23], which may produce readout errors. The decision of whether to discard the event or not is made by the Backend Readout block, and therefore, this circuitry does not add any timing uncertainty to the input hit.

RIMC
The circuit shown in Figure 3 is a novel clock synthesizer composed of a ring oscillator array coupled by means of resistors, thus providing 56 clock phases of the system clock. These phases are organized into seven rows by eight columns of Delay Elements (DEs). Note that oscillation is achieved by inserting an odd number of rows and connecting the outputs of the last DEs to the inputs of the first DEs. One of the benefits of this architecture is the mesh structure, which partially mitigates any local effect (mismatch) of process variations, since the neighbors will absorb part of the variations of a given node.
The DE (see Figure 3b) contains a current starved inverter, which fixes the row width to 1/14 of the system clock period (from 119 ps in ULP mode, to 78 ps in in HP mode) with the Phase-Locked Loop (PLL) Control Voltage (VCTL), while the resistor introduces a 20 ps subgate delay between adjacent columns (from left to right). The typical end-to-end delay between the first and the last column nodes for a given row is 175 ps since the number of columns is eight (the first column in the left is used as dummy). This delay is fixed, and it only depends on the manufacturing process conditions. The resistor value, which couples adjacent ring oscillators, is selected in such a way that there are always two DEs switching in consecutive rows (one rising edge and one falling edge) when operating in the typical mode (800 MHz). This avoids any clock duty cycle mismatch between adjacent rows, and it will allow the TDC to obtain time bins smaller than the 20 ps subdelay when combining the phase information of the measurements. Figure 4a,b shows the chronograms for the ULP and HP modes, respectively. It can be seen that the higher the RIMC oscillation frequency is, the higher the row overlapping and the smaller the bin size are. On the contrary, when reducing the RIMC frequency, row overlapping will decrease, and the bin size will increase accordingly. This adjustment can also be used to compensate subgate delay variations produced by changes in the RIMC resistor values, due to wafer-to-wafer and run-to-run variations during the manufacturing process.  Figure 4. Chronogram of the RIMC nodes (sorted by rows) in ULP and HP modes. The phase delay between columns within the same row is static and dynamic between rows (depending on the oscillation frequency).

PLL
The system clock is obtained from the RIMC, acting as a Voltage-Controlled Oscillator (VCO) from the PLL point of view. The PLL block (see Figure 5) consists of a Phase-Frequency Detector (PFD), a Charge Pump (CP), and the Clock Manager (CM). The PFD generates charge and discharge pulses proportional to the clock phase shift between the external reference clock (CLK_REF) and an internal feedback clock (CLK_FB) [24]. These charge and discharge signals drive the gates of two transistors acting as current sources. The pll_Icp bit allows modifying the delivered current to the intVctl node, which is connected to an RC circuit acting as a low-pass filter. C1 is a 3 bit switched capacitor, which allows a tuning range from 4 to 32 pF. The intVctl drives an operational amplifier acting as a unity gain buffer. This buffer will drive the VCTL node of the four RIMCs in the ASIC. Finally, the CM allows selecting the operating frequencies for the following clocks: feedback (PLL M factor), Backend Readout, Serializer, and ASIC output, which can operate either in SDR (Single Data Rate) or DDR (Double Data Rate) mode.

Time Capture Matrix
Both edges of the TIME<15:0> inputs are converted into short pulses by the Edge Detectors. As seen in Figure 6, these pulses latch full custom flip-flops optimized for fast timing (mismatch variability optimization and 50% duty cycle of the input data). The output of these flip-flops will contain the phases of the clock matrix coming from the RIMC.

Coarse Counter
This block complements the fine counter and provides a 10 bit counter based in a ripple carry adder. The block layout was implemented in full custom mode to optimize the critical path delays (to ensure reliability when counting at 920 MHz in HP mode) and also to optimize power consumption and area. The counter provides between 1.11 (HP) and 1.71 (ULP) microseconds of dynamic range for both ToF and ToT. An external system, such as a microcontroller or FPGA, can easily extend the ToF dynamic range to an arbitrary value.

Backend Readout
As seen in Figure 7, this block receives the digital representation of the incoming hits from both fine and coarse counters, then encodes, aligns, and filters (if necessary) the data, stores the events, and finally, sends the data to the Serializer. This block can operate at two frequencies (100 and 200 MHz in typical operating mode) depending on the required throughput.
The fine encoder block receives seven (one per row) encoded columns with the state of the RIMC when the hit occurred. The first nonzero column determines the offset (8 LSB per row) of the fine counter, and then, all the nonzero column values are summed, therefore achieving a counter ranging from 4 to 130 LSB. This combination of several row hits (stochastic TDC) allows computing TDC fine bins much smaller than the nominal subgate delay (20 ps) and allows different operating modes, allowing users to optimize the trade-off between timing resolution and power consumption in each application.
The coarse counter alignment block allows synchronizing both coarse and fine counters (asynchronous). It receives the 10 bit coarse counter measurement and the (LSB_CHANGE) alignment bit. This bit contains the clock phase of the coarse counter when the capture was performed, and it is compared with the fine counter. If a mismatch is detected (fine counter close to full scale and LSB_CHANGE=0), the coarse counter value is decreased by 1 LSB, and hence, the counters are synchronized.
The event builder receives the aligned fine and coarse counters and the Edge Detector TIME_FILTERED (see Section 2.2) signal after being synchronized. Once both the rising and falling edges are captured, the event is ready to be sent, and the data are stored into a 16-event FIFO. One event consists of 5 B: channel identifier, coarse and fine ToF/ToT, and debug bits.
Finally, the event transmitter block converts events into bytes and manages the data transmission protocol: it adds the Start-of-Packet and End-of-Packet bytes before and after the event transmission and the Idle byte when there is no activity. A chronogram example can be seen in Figure 8.

Serializer
This block performs the 8:1 parallel-to-serial conversion and transmission either in SDR or DDR mode. Serialized bits are driven by a Low-Voltage Differential Signaling (LVDS) driver with an adjustable differential mode current. Data transmission was successful at 920 Mbps in HP mode, even with the minimum differential current (0.35 mA, 0.65 mW power consumption).

Methods
This section provides an overview of the experimental setups employed to evaluate the performance of the MATRIX16 ASIC. The control and Data Acquisition (DAQ) system was based on two Printed Circuit Boards (PCBs): The first one was the motherboard, which had an Intel MAX 10 FPGA, a USB interface, voltage regulators, and interface connectors. The second PCB (mezzanine) contained the ASIC socket and the corresponding power regulators, which can be bypassed when an external power supply is used to characterize the ASIC using different supply voltages. Both boards were coupled by means of an LSHM connector, as seen in Figure 9. The FPGA controls the ASIC via SPI and acquires data from the Serializer outputs, then performs the communication with a host PC via the USB protocol. The test bench to calibrate the ASIC and perform jitter measurements is depicted in Figure 10. A very stable clock is generated by a Pulse Pattern Generator (PPG) (Agilent 81110A), which produces a 100 MHz reference clock (CLK_REF) for the typical operating mode. This frequency varied according to the operating mode under test. MATRIX16 ASIC has an external trigger pin that can be internally redirected to each of the 16 ASIC channels, hence simplifying the measurement setup. This external trigger input can be connected to either Trigger_rnd (from FPGA) or Trigger_syn (from PPG). The power supply (Agilent E3631A) configuration and current consumption measurements, as well as the PPG control were also automated using the GPIB protocol.
The FPGA generates Trigger_rnd to perform calibration measurements, which is uncorrelated with the CLK_REF generated by the PPG. This allowed analyzing the statistical behavior of static Process, Voltage, and Temperature (PVT) variations, thus obtaining calibration tables and computing linearity for each ASIC. For the jitter measurements, both CLK_REF and Trigger_syn are provided by the PPG. This generator produces a pulse in phase with the clock, which can be electronically controlled via GPIB.

Linearity Test
The purpose of this test was twofold: On the one hand, this was performed to characterize the effects of static variability (temperature, IR drops, process, and mismatch) either within-die and die-to-die [25]. On the other hand, the test would provide calibration data, which would help to reduce the linearity error of the TDC and therefore improve the timing resolution.
Calibration was performed by means of a code density test [26]. This test consisted of producing a very large number (200 k in this case) of random pulse hits following a uniform distribution at the TDC input channels. Such a number of repetitions would reduce statistical fluctuations due to dynamic effects such as jitter. The binary code corresponding to wider TDC bins (slower stages) would occur more often than the narrower ones due to the uniform distribution of the incoming hits. TDC bin sizes can be obtained by normalizing the number of hits of each TDC bin to the RIMC oscillation period (see Equation (1)), since the sum of the hits for all codes is equivalent to the total number of hits.
Finally, we can obtain the Differential Nonlinearity (DNL) and the Integral Nonlinearity (INL) of each TDC channel (see Equations (2) and (3)), which would show the statistical impact of the mismatch on our TDC.

Jitter
Jitter was measured by injecting N synchronous pulse shots (20 k in the current test) with the PPG and measuring the standard deviation of the ToF measurement. This procedure was repeated in 5 ps steps within the full dynamic range of the fine counter. The objective of the sweep was to obtain a more representative sampling of the jitter within the whole fine counter transfer function than a single measurement in an arbitrary phase. The jitter of a given channel was obtained by computing the quadratic mean of the jitter measurements.

Experimental Results
This section shows the linearity, jitter, and power consumption measurements of the MATRIX16 chip prototype in different operating modes. The voltage and frequency settings under each operating mode (profile) are detailed in Table 1. HP mode pursues the maximum chip performance (timing resolution), while ULP mode focuses on optimizing power consumption. The intermediate modes (TYP and LP) try to reach a trade-off between power and performance.
The number of chip samples for the characterization was 15, leading to a population of 240 channels. The typical values shown in the plot legends in Figures 11-14 (σ Typ ) correspond to the quadratic mean of the 240 channels for each measurement.  Figure 11 shows the DNL standard deviation distribution for the 240 channels, showing that the DNL standard deviation was typically around 2/3 of its corresponding fine counter LSB in all the operation modes. Figure 12 shows the maximum INL, which was typically around 2 LSB (3 LSB in the worst case). Figure 13 shows the maximum bin width that we could obtain from each TDC channel. This measurement allowed determining what the single-shot precision was in the worst scenario: around 3.5 LSB. It is important to mention that once the corrections obtained from calibration data were applied, the DNL's standard deviation was reduced to 2-4 ps, and the maximum INL became negligible.    Figure 14 shows the ToF jitter standard deviation distribution for the 240 channels. It can be seen that for the same power supply voltage (TYP and HP modes), jitter linearly increased with the RIMC period, while it dramatically increased as the RIMC power supply scaled down (LP and ULP modes).   Table 2 shows the typical power consumption for each ASIC power domain and performance profile, at a 100 kHz conversion rate. Most of the power consumption did not depend significantly on the ASIC conversion rate since the RIMC, which was always on, took around 60 to 70% of the power budget. Other circuits, such as the LVDS Serializer output lines, were also always on. The total ASIC power consumption was 46.5 mW (2.9 mW/ch) in ULP, 80.4 mW (5.0 mW/ch) in LP, 131 mW (8.2 mW/ch) in TYP, and 146 mW (9.1 mW/ch) in HP mode.

Discussion
As seen in the state-of-the-art, FPGA and ASIC TDCs can offer similar performance in terms of TDC bin size and resolution. The advantages of FPGAs are a faster development time, lesser prototyping cost, and higher flexibility. However, the power consumption requirements (<10 mW/ch) are too stringent for FPGA TDC designs, where silicon is not optimized for such purposes. The unit price and chip area are also key limiting factors for building large PET systems with thousands of TDC channels and high channel density requirements. Moreover, TDCs can be integrated into the same substrate where the sensor frontend readout circuitry is implemented, leading to very compact System-on-Chip (SoC) solutions [27,28] or digital SiPMs [29].
There are many multilevel TDC ASIC implementations in the literature. Each implementation type aims to optimize a given specification, and this makes it difficult to draw a fair comparison of the proposals. For this reason, Table 3 restricts the performance comparison to recent flash TDC implementations and our work. The Figure-of-Merit (FoM), defined in Table 3c, allows benchmarking the different proposals, where the minimum FoM corresponds to the TDC with the best combination between timing resolution and power consumption.
MATRIX16 not only increases channel density with respect to MATRIX4 [21], but also integrates new functionalities. The most important one is the ToT measurement, which increases the number of target applications for this ASIC. Event filtering by the pulse width reduces the number of dark count pulses to be processed by the readout system, and 4:1 multiplexed data links allow transmitting ToF+ToT information from 16 channels with the same number of Serializer links as MATRIX4. Moreover, the Backend Readout data encoding improvements slightly improved the timing resolution (from 10.1 to 9.5 ps, without calibration), and the low-power digital design techniques reduced power consumption (from 11.3 to 9.1 mW per channel). It is important to highlight that the ASIC presented in this work achieved a <10 ps timing resolution with less than 10 mW of power consumption per channel, which was one of the major constraints in the choice of the TDC architecture.
The most similar work to our proposal is PicoTDC [30], the flash TDC ASIC with the best FoM (69 fJ/conv) and timing resolution (3.4 ps without calibration), where resistive interpolation was also used to achieve subdelay elements. The main differences between PicoTDC and our proposal were the resistive mesh topology and the technology node: 65 nm in PicoTDC vs. 180 nm in MATRIX16. Even with this gap between the technologies, which penalizes the minimum TDC bin size and power consumption, the achieved FoM in this work (HP mode, 86 fJ/conv) was close to the PicoTDC proposal, which clearly has room for improvement if implemented in a more advanced technology node.
The work in [15] is also remarkable, where 9.8 ps of resolution was achieved with 12 mW/ch (FoM of 119 fJ/conv). This is especially difficult in a 350 nm technology, with slower and more power-demanding transistors.

Conclusions
A 16-channel TDC ASIC was designed, implemented, and tested. One of the key features of this chip is that achieved an 8 ps time resolution (after calibration) with 9 mW/ch and a peak conversion rate of 50 MHz, making it suitable for ultra-fast timing applications with a moderate power consumption. The RIMC overlapping flexibility allows working under different modes, thus optimizing the trade-off between power consumption and timing resolution. In fact, assuming an excellent SPTR of a given state-of-the-art frontend (e.g., 60 ps sigma [8]), the impact on timing degradation produced by MATRIX16 in LP mode (20 ps) was very small (3.5 ps), while the power consumption was almost reduced by 50% (4.9 mW/ch).
The major obstacle that prevents using this ASIC in ULP mode (3 mW/ch) and beyond is related to the RIMC clock jitter, which was dramatically degraded as the RIMC power supply (the largest power consumption contribution) was scaled down. Further research should be addressed to keep an acceptable clock jitter (<20 ps) even when power supply is lowered to 1.2 V (nominal V DD is 1.8 V).
This ASIC was designed on purpose to be the backend readout chip of the HR-FlexToT ASIC. Thus, both chips can be easily integrated into a system-in-package, which opens the door to building large PET systems with a high channel density while maintaining a low power consumption. Moreover, channel data output multiplexation reduced the number of serializers by four.