Motion-Based Object Location on a Smart Image Sensor Using On-Pixel Memory

Object location is a crucial computer vision method often used as a previous stage to object classification. Object-location algorithms require high computational and memory resources, which poses a difficult challenge for portable and low-power devices, even when the algorithm is implemented using dedicated digital hardware. Moving part of the computation to the imager may reduce the memory requirements of the digital post-processor and exploit the parallelism available in the algorithm. This paper presents the architecture of a Smart Imaging Sensor (SIS) that performs object location using pixel-level parallelism. The SIS is based on a custom smart pixel, capable of computing frame differences in the analog domain, and a digital coprocessor that performs morphological operations and connected components to determine the bounding boxes of the detected objects. The smart-pixel array implements on-pixel temporal difference computation using analog memories to detect motion between consecutive frames. Our SIS can operate in two modes: (1) as a conventional image sensor and (2) as a smart sensor which delivers a binary image that highlights the pixels in which movement is detected between consecutive frames and the object bounding boxes. In this paper, we present the design of the smart pixel and evaluate its performance using post-parasitic extraction on a 0.35 µm mixed-signal CMOS process. With a pixel-pitch of 32 µm × 32 µm, we achieved a fill factor of 28%. To evaluate the scalability of the design, we ported the layout to a 0.18 µm process, achieving a fill factor of 74%. On an array of 320×240 smart pixels, the circuit operates at a maximum frame rate of 3846 frames per second. The digital coprocessor was implemented and validated on a Xilinx Artix-7 XC7A35T field-programmable gate array that runs at 125 MHz, locates objects in a video frame in 0.614 µs, and has a power consumption of 58 mW.


Introduction
Computer vision is a discipline that has gained an important place in data analysis on scientific and industrial applications. Among the applications of computer vision are obstacle detection [1] and position and speed estimation for accident avoidance [2] in driverless cars, pedestrian detection using infrared cameras for surveillance [3,4], autonomous underwater monitoring for detecting life on the seabed [5], improvement in the food industry using real-time smart machines and predictable models [6], and real-time pupil localization for driver safety improvements [7,8], among others.
From the point of view of engineering applications, one of the fundamental tasks of computer vision is object detection [9,10]. Object detection has the goal of determining the positions of objects in an image (object location) and determining the semantic categories of data from the previous frame. As a result, the design computes the frame difference for all the pixels in the imager in parallel during the integration time used to acquire the pixel data. The proposed SIS consists of a heterogeneous architecture composed of an analog stage and a digital stage. On the analog stage, the SIS calculates the difference between consecutive frames, which is the first step of motion-based object location, on the bidimensional smartpixel array. On the digital stage, the SIS includes a digital coprocessor that computes morphological transformations on the output image and uses a connected-components algorithm to label the objects in the image.
Using a 32 µm × 32 µm pixel, the extra capacitor and switches added to the CTIA to compute frame differences reduce the fill factor (the fraction of the die area dedicated to the photodetector) from 47.6% to 28.1% in a 0.35 µm TSMC process. To evaluate the scalability of our design, we implemented the smart pixel on a 0.18 µm process, achieving a fill factor of 74%, compared to 86.1% for a traditional pixel that uses a conventional CTIA. Using an array of 320 × 240 pixels, the SIS acquires and computes frame difference at 60 frames per second (fps). Running at 125 MHz, the digital coprocessor uses the frame differences to detect objects in the image in 0.614 µs and consumes 58 mW of power.
The rest of the paper is organized as follows. Section 2 introduces and discusses previous works related to our smart image sensor. We describe the object location algorithm implemented in our SIS in Section 3. In Section 4, we describe the architecture of the smart pixel and the proposed SIS and the architecture of the digital coprocessor. Section 5 describes the resulting area of the smart pixel, the resource utilization of the digital coprocessor, and the simulation results. Finally, Section 6 concludes the paper.

Related Work
Object-location and classification algorithms are usually implemented on highperformance, high-power hardware platforms such as General-Purpose Processors (CPUs) or Graphics Processors (GPUs) [44,45]. This can be acceptable in a wide variety of solutions but is normally inadequate for low-power applications on embedded or mobile devices. In these cases, special-purpose processing systems on dedicated hardware can achieve high speed and portability with low power. These designs are normally implemented on Field-Programmable Gate Arrays (FPGAs) and Application-Specific Integrated Circuits (ASICs).
In recent years, many FPGA-based Convolutional Neural Network (CNN) architectures have been proposed for object location and classification [46][47][48][49][50]. A common disadvantage of these solutions is the limited capacity of on-chip FPGA memory, which is insufficient to store the large number of parameters required by the CNN. Storing these parameters in external memory reduces the throughput of the implementation, therefore limiting the application of FPGAs for small form-factor object detection in real time. This issue was addressed by Long et al. [51], who implemented an FPGA-based object detection algorithm based on multi-frame information fusion. Their algorithm uses a reduced number of parameters for HOG-based object location and Support Vector Machines (SVMs) for classification, and achieves a throughput of up to 10,000 fps. Nakahara et al. [52] presented an object detection algorithm based on a multiscale sliding-window location search, which binarizes the CNN parameters to reduce their memory requirements, and enables the implementation of the complete network using only on-chip memory. Despite the throughput improvement achieved by using on-chip parameters, all the solutions described above read the image pixels as a serial stream from the image sensor. This has the effect of increasing the latency and limiting the data parallelism available to the algorithm, compared to having access to all the pixels simultaneously. Moreover, algorithms that access the image data serially require line buffers or even entire frame buffers, further increasing the memory requirements of the hardware platform.
As discussed in Section 1, SISs are special-purpose image sensors that combine conventional imagers with additional circuitry to process pixel data on the same chip. When an SIS is designed with computational circuits in every pixel (smart pixels), it can exploit the pixel-level parallelism available in the image-processing algorithm. This shortens latency, increases throughput, and reduces the memory requirements of the solution [30,31,43,53]. To further reduce power and area, SISs typically use analog circuits to store and process the data [30,40]. For example, Lee et al. [40] presented an SIS with embedded object detection that uses a reconfigurable pixel array capable of computing frame differences and spatial gradients. The SIS uses a capacitor in every pixel that acts as an analog memory to compute frame differences. Choi et al [30] use a similar approach to implement motion-triggered object detection. To reduce circuit area and improve fill factor, they use the same capacitor for two horizontally adjacent pixels and alternate its use between odd and even frames, thus trading motion-detection horizontal resolution for fill factor. An alternative technique to improve fill factor is to perform computation during photocurrent integration using an intelligent Readout Circuit (iROIC) [54]. For example, the SIS presented by Gottardi et al. [55] computes local gradients using this technique to implement a lightweight version of local-binary patterns (LBP). Our own previous work [43] also uses an iROIC to compute face recognition in visible-range and IR image sensors using an array of smart pixels based on a configurable CTIA, reducing precision by only 1% compared to a fully digital implementation.
Implementing most of the computation at the pixel level using smart pixels reduces the die area available for the photodector in each pixel, thus reducing the fill factor of the imager. To mitigate this effect, several SISs implement part of the computation at the column level, thus improving fill factor at the cost of reducing parallelism and increasing memory requirements. For example, Jin et al. [33] designed an SIS that computes edge detection using column-level circuits and static memory. Young et al. [56] presented an SIS for object detection that combines pixel-and column-level processing to compute image features based on HOG. Their SIS eliminates redundant illumination data during readout, thus compressing the HOG feature descriptors by up to 25 times compared to a conventional 8-bit readout. Kim et al. [37] detect and recognize faces by combining, on a single chip, a standard imager architecture and a mixed-signal CNN that implements its first layer in the analog domain. Computing part of the operations of the algorithm using analog circuits degrades the accuracy by 1.3%, but it also reduces the power consumption by 15.7% because Analog-to-Digital Converters (ADCs) are one of the most power-consuming elements in standard CIS [57]. The SIS presented by Zhong et al. [29] computes edge detection and omnidirectional LBP using column-level circuits and array of capacitors capable of storing two rows of the image. Our own previous work [43] computes LBP using a combination on-pixel and column-level processing. An array of smart pixels performs the comparisons between adjacent pixels and outputs a binary value, which is used by column-level circuits to construct the LBP features. The single-bit output of the smart pixel allows us to reduce the memory required by the line buffers and the time to read the data from the pixel array and improves the fill factor by moving a significant part of the computation to the column-level circuit.
The discussion above shows that SISs are a viable alternative to digital processors to achieve the low power and high performance required by computer vision on mobile devices, while achieving comparable precision [37,43]. An SIS can exploit the pixel-level parallelism of the algorithm, but the area used by the processing circuits limits the fill factor of the SIS. This can be mitigated by performing part of the computation during photocurrent integration and by moving computation to column-level circuits. The design presented in this paper uses both techniques to build a two-mode imager that operates as a conventional sensor and computes object location, using a configurable CTIA suitable for the thermal IR range.

Object-Location Algorithm
Our SIS performs object location using a multi-frame approach, which considers the information of consecutive video frames to locate moving objects [58,59]. Algorithm 1 summarizes the motion-based object location algorithm implemented by the SIS. It first computes the frame difference between pixels in consecutive video frames and compares this difference to a threshold to discriminate object pixels from the background. Then, the algorithm applies morphological operations to remove motion-detection artifacts. Finally, a connected-components algorithm computes the object bounding boxes from the motion pixel data. Figure 1 illustrates the operation of the algorithm, showing an input image and the output of each stage.
Algorithm 1: Motion-based object location input : Input frame im k,m×n , previous input frame im k−1,m×n , threshold THR output : Output frame of highlighted object-pixels and bounding boxes. (1); Update bounding boxes; return Image with highlighted objects im dlt , labeled objects im lb and bounding boxes The frame-difference stage of the algorithm uses the same approach presented by Bir Bhanu et al. [60]. This algorithm detects pixels that belong to moving objects by comparing their value in consecutive frames. When the difference between the pixel value between two consecutive frames is higher than a predefined threshold, the algorithm assumes that the pixel belongs a moving object. This threshold is an application-dependent sensibility parameter that can be adjusted manually by the user. As shown by Yin et al. [61], an adequate threshold value can be determined by training the algorithm using a set of image data from the application environment.
The first stage of the algorithm computes the absolute frame-difference: it computes the absolute difference between corresponding pixels in two consecutive video frames. The next stage compares each pixel difference to an application-defined threshold: if the absolute frame difference is greater than the threshold, we identify it as a movement pixel and assign it a logic label of value 1. Otherwise, the pixel is labeled as 0. Figure 1c illustrates the output of the threshold stage. The figure shows that this method can produce isolated labels due to abrupt changes in pixel values. To compensate for this, it is common to add a morphological operation stage. We apply an image opening operation, which consists of image erosion followed by dilation. Figure 1d shows that this considerably reduces the number of isolated labels. Finally, Figure 1e shows, using different colors and bounding boxes, the output of the connected components algorithm, which labels the objects found in the image. From this point, it is possible to further extend image analysis to process shapes, single objects, and more.
To perform erosion and dilation, we used 3 × 3-pixel kernels. Since these morphological operations use single-bit pixels, the computation is reduced to simple logical AND and OR operations. Image erosion replaces the center pixel in a 3 × 3 window with the minimum value in the window. With binary images, erosion replaces the pixel with logical 0 if there is at least one pixel equal to 0 in the window (AND), as described in Equation (1): Image dilation replaces the center pixel in a 3 × 3 window with the maximum value in the window. With binary images, dilation replaces the pixel with logical 1 if there is at least one 1 pixel in the window (OR), as described in Equation (2): The third stage of the algorithm computes the bounding boxes for the objects located in the image in a single pass using a raster-scan connected components algorithm [62]. Figure 2 illustrates the operation of the algorithm. For each movement pixel in the image, the algorithm looks at its north, northwest, and west neighbors. If none of them are also a movement pixel, the algorithm assigns a new object label to the pixel, as shown in Figure 2a,b. Otherwise, if the neighboring movement pixels are part of the same connected component, the algorithm assigns the same object label to the new pixel, thus adding it to the connected component ( Figure 2c). If the neighbor movement pixels belongs to different connected components, the algorithm assigns one of the labels to the new pixel and merges the connected components by adding a new entry into the equivalence table ( Figure 2d). The base procedure of the algorithm is described as the priority-OR operation in Equation (3): where L 0 is the label for no-object pixels, and L w , L nw , and L n are labels of the west, northwest, and north pixels, respectively. Every time the algorithm creates a new connected component or adds a pixel to an existing component, it updates the coordinates of its bounding box in a table. When the algorithm merges two connected components, it updates the bounding box.  Figure 3 shows the architecture our proposed SIS. The SIS supports two operation modes: standard imager and motion-based object location. The core blocks are an array of smart pixels with local computational resources, an analog comparator (A-THR), and a digital coprocessor. In the standard mode, the pixel array acquires image data as a conventional image sensor, row-and column-select circuits sequentially read the pixel data, and an ADC produces digital values as a pixel stream. In object location mode, the SIS configures smart pixels to compute the frame difference, uses the A-THR to evaluate motion on each pixel and uses the digital coprocessor to determine the objects in the scene.  Figure 3. Architecture of the proposed SIS. An array of smart pixels outputs either the pixel value or the frame difference. The A-THR module determines whether the absolute value of the frame differences exceeds an application-defined threshold. The digital coprocessor computes image opening to improve the object location and uses a connected components algorithm to detect objects in the image and compute their bounding boxes. The digital coprocessor can be configured to output the original image or the binary image and the bounding boxes for the objects. Figure 4 shows a block diagram of our object location algorithm and the hardware modules of the proposed SIS that perform each step of the algorithm. In the center, the figure shows the three core blocks of the SIS, which are the smart pixel array, an analog comparator (A-THR) core, and the digital coprocessor. The first step of the algorithm is the temporal shift calculation implemented in the smart-pixel array. The second step is motion estimation, implemented on a custom analog-threshold circuit. The following steps correspond to the binary-image erosion and dilation and the connected components implemented on a digital coprocessor. With this configuration, the SIS computes the frame difference simultaneously along the array. In object location mode, the array acquires and stores the image data for the current frame and computes the difference between the current frame and the stored data for the previous frame. Then, the row-and column-select circuits sequentially read the frame differences and send them to the A-THR, which determines if the absolute difference of each pixel in the image is greater than the application-defined threshold. The output of the A-THR is a single bit for each pixel that serves as input to the digital coprocessor. The coprocessor computes the binary erosion and dilation, the connected components, and outputs the bounding boxes and binary image.   Figure 5 shows a block diagram of our smart pixel composed of a photodetector, switches for input enable and row-select, and a programmable CTIA. The input of the CTIA is connected to a photodetector. The CTIA uses the signals NegInt, PosInt, and BuffSL to integrate the photodetector current. The resulting voltage represents the value of the pixel in a frame or the difference between the pixel values in the current and past frames. A schematic view of the smart pixel is shown in Figure 6. The smart pixel is based on a conventional CTIA used for photodetector current integration, where we replace the single integration capacitor with two identical capacitors that act as a double buffer and six CMOS switches that select the active buffer and control the integration direction. The switches are controlled by three configuration signals that allow our custom CTIA to compute the difference between the pixel values in two consecutive frames during integration.

Smart Pixel
Cint2 sw1 sw3 Figure 6. Configurable CTIA. The output voltage of the CTIA represents either the pixel value or the difference between the pixels in the current and past frame. Our configurable CTIA includes two integration capacitors of equal size, C int1 and C int2 , which are used as double buffers to integrate and compute the frame difference.
As described in [63][64][65][66], the CTIA is a preferred approach in two scenarios: environments with low light and IR cameras. Among the various types of circuits available in the literature, the CTIA configuration uses more area than many others for readout. However, as we described in our previous work [43], the CTIA has the following advantages: (1) low input impedance for good injection efficiency with weak photodiode currents, (2) less frame-to-frame latency, (3) wider linear output voltage range [67], (4) and reduced noise through better control of the photodiode bias [68].
The operation in conventional mode of our smart pixel is shown in Figure 7a, also referred to as direct mode. In conventional modem the operation of the smart pixel is equivalent to the conventional CTIA. For this operation, the smart pixel sets the bias voltage to 0V, and the CTIA integrates the input current on the capacitor C int1 . As shown in Figure 7b, in direct mode the smart pixel works as a conventional CTIA, where it sets the switches sw1, sw4, and sw5 as closed and sw2, sw3, and sw6 as opened. Equation (4) describes the output value at the end of the integration time: where V is the voltage at the output of the smart pixel, I is current from photodetector PD1, ∆t is the amount of time that takes to integrate, and C int1 is the capacitor value.  The operation of the smart pixel when computing the frame difference in the pixel is shown in Figure 8. The smart pixel sets the global bias input at the midpoint of the operation voltage. During a single video frame, the circuit operates in two stages: store and subtract, assigning one half of the integration time to each. During the store stage, the circuit integrates the input current into one of the capacitors, which will be used in the next video frame. During the subtract phase, the CTIA subtracts the input current from the second capacitor, which stores the pixel value of the previous frame. These stages operate in a slightly different way during odd and even frames because capacitors C int1 and C int2 operate as a double buffer: C int1 is used to store pixel data (store phase) during an odd frame and to subtract the current pixel value (subtract phase) during an even frame. Conversely, C int2 stores pixel data during an even frame and subtract the current pixel value during an odd frame. Figure 8a,b show the equivalent circuits during an odd video frame. Here, the store and subtract phases integrate the input current in the positive direction in both capacitors: sw2 and sw3 are closed, and sw1 and sw4 are open. During the store phase, shown in Figure 8a, sw5 is closed and sw6 is open to integrate the input current on C int1 . During the subtract phase, shown in Figure 8b, sw6 is open and sw5 is closed to integrate (subtract) the input current on C int2 , which contains the pixel value acquired in the previous frame value. At the end of the frame time, the voltage across C int1 represents the current pixel value to be used in the next frame, and the voltage across C int2 is the frame-difference between the current and previous frames.  is open and sw6 is closed, and during the subtract phase, the states off sw5 and sw6 are reversed. At the end of the frame, the voltage across C int1 represents the frame-difference between the current and the previous frame. During even frames, the state of all switches is the complement of the odd frames, and the frame-difference is represented by the voltage across C int2 . Figure 8c,d shows the equivalent circuits during an even video frame, which integrate the input current in the negative direction of both capacitors. The store phase integrates on C int2 and the subtraction phase uses C int1 . The state of switches sw1-sw6 is the complement of the odd frames.
At the end of the integration time, the CTIA outputs the voltage across capacitor C int2 for odd frames and C int1 for even frames, which represents the frame-difference value. The voltage across the capacitors at the end of the integration time is shown in Equation (5): where V c is the output voltage, k is the current frame index, I k is the input current during frame k, I k−1 is the current during the previous frame k − 1, ∆t s is the integration time where ∆t s = ∆t 2 , and C int is the capacitance of C int1 and C int2 . After the frame-difference values are read by the A-THR core, the circuit resets the capacitor that holds the framedifference value.
Note that because we store and subtract in different directions during odd and even frames, the frame-difference in V c has a different sign in consecutive frames. This does not affects the results of the algorithm because the next stage uses the absolute value of the difference. Alternating the sign of the frame differences allows us to configure the CTIA using only three control signals. Moreover, sw1-sw6 switch only once per frame instead of once per phase, which reduces charge injection and power consumption.
Because the operation of the smart pixel divides the integration time into two stages of equal duration (store and subtract), it effectively reduces the integration time to 50% of the conventional mode. This decreases the signal-to-noise ratio of the imager but allows it to compute the frame-differences of all the pixels in the image in parallel.

A-THR
The A-THR core determines whether the absolute value of the frame difference computed by the smart pixel exceeds an application-defined threshold. Figure 9 shows the A-THR circuit. A row-and column-select circuit scans the smart pixel array, reading the output voltage of each CTIA connecting it to the input V pixel of the A-THR module. Because the frame-difference output of the smart pixel has a different sign for even and odd frames, the A-THR core uses two comparators, OA1 and OA2. The reference voltages V + THR and V − THR are used to compare the absolute vale of the frame-difference voltage to the threshold voltage V th , such that V + THR = V bias + V th and V − THR = V bias − V th , where V bias = V dd /2 is the bias voltage of the CTIA in frame-difference mode.  Figure 9. Architecture of the A-THR. An input comparator compares the frame difference for each pixel to two reference voltages. The comparator OA1 outputs a logical 1 when V pixel > V + th , and the comparator OA2 outputs a logical 1 when V pixel < V − th . A logical OR outputs a logical 1 if one of the two conditions is met.
The comparator OA1 outputs a logical 1 when V pixel > V + th and 0 otherwise, while OA2 outputs a logical 1 when V pixel < V − th and 0 otherwise. These two comparators independently indicate when V pixel is greater than V + th or less than V pixel . An OR gate outputs a logical 1 when either OA1 or OA2 outputs a 1, thus indicating that the framedifference is greater than the supplied threshold. These logic values are generated for all the columns of the array in parallel and stored in a shift register. While the A-THR blocks process the next row, the shift register serially outputs the values from the previous row to the digital coprocessor.

Digital Coprocessor
The coprocessor adds programmability to the SIS by processing the output in framedifference mode using reconfigurable digital logic. In our current implementation, the coprocessor implements the morphological opening operation and a connected components algorithm that detects objects and computes their bounding boxes. Figure 10 shows the architecture of the object location coprocessor. The data flow of the digital coprocessor is as follows: the object location coprocessor receives a 1-bit pixel stream from the A-THR module. Then, the coprocessor computes morphological erosion and dilation operations in a 3 × 3-pixel window and outputs the resulting binary image. The image pixels are also processed by a connected components module, which identifies the objects in the image using connected pixels in a single pass and computes the bounding boxes of the objects.  Figure 10. Architecture of the digital coprocessor. The coprocessor receives a stream of movement pixels, applies morphological opening operation (erosion+dilation), and computes the connected components of the resulting binary image and their bounding boxes.
The digital implementation of the 1-bit image erosion, defined in Equation (1), is shown in Figure 11. We erode with a 3 × 3 window by calculating the logical AND between all pixels in the window. We implemented the sliding window using two line buffers and a 2 × 3 array of Flip-Flops (FFs). Figure 12 shows the implementation of the 1-bit image dilation, defined in Equation (2). Dilation's methodology is similar to erosion, which uses the same architecture but replaces the logical operations with OR gates.  Figure 13 shows the architecture of the connected components module. The Neighborhood Context block uses a line buffer to define a 2 × 2-pixel window that contains the current pixel and its north, northwest, and west neighbors. The block output indicates whether the current pixel is an isolated movement pixel, to which of its neighbors it is connected, or whether it is not a movement pixel. The Label Selector block assigns a new or existing label to the current pixel based on its neighboring labels, using a line buffer with label information. Because new pixels can join disconnected regions, the module uses an equivalence table to merge connected components. The Label Management block updates the equivalence table using the information from the neighborhood context. As new pixels are added to the existing connected components, the coordinates of their bounding boxes are updated using the contents of the equivalence table to consolidate regions as they merge.  Figure 13. Architecture of the connected components module. First, it analyzes the current pixel and its north, northwest, and west neighbors, determining which movement pixels are connected. The module assigns a label to the current pixel and maintains an equivalence table to merge connected components in the image. The module also computes the bounding boxes for all connected components and merges them using the equivalence table.

Smart Pixel and A-THR Implementation
To implement the smart pixel shown in Figure 6, we used minimum-size transistors to implement switches sw0-sw6. Switch sw0 uses an NMOS transistor, and switches sw1-sw6 are full transmission gates. The OPAMP in the custom CTIA and the comparators in the A-THR (Figure 9) use the design that we presented in [69]. The OR gate is a standard CMOS logic circuit. Figure 14 shows the physical layout for the smart-pixel as described above. This design uses a 0.35 µm mixed-signal process, 950 aF/µm 2 poly1-poly2 capacitors, and a supply voltage of 3.3 V. The post-layout simulations with parasitic extraction presented in Section 5.2 were obtained using this process, for which we have access to the necessary technology files. The integration time of our smart pixel is 20 µs, and the maximum current that the photodetector delivers is 8 nA. With this, the two integration capacitors have an equal capacitance of 50 fF with a size of 7.7 µm × 7.7 µm. The area of the entire smart-pixel circuit of is 32 µm × 23 µm, which achieves a fill factor of 28% in a standard 32 µm × 32 µm pixel [40]. In comparison, a conventional CTIA circuit designed on the same process has a fill factor of 47.6%. Figure 14. Diagram of the smart-pixel layout. We used the design shown in Figure 6, implemented on the TMSC 0.35 µm mixed-signal process. The Opamp and integration capacitors are implemented using two poly layers.
In order to assess the impact of technology scaling on the fill factor of the smart pixel, we redesigned the pixel using a 0.18 µm TMSC process, a technology commonly used in the literature [29,31,40]. For this technology, we used a supply voltage of 1.8 V and metal capacitors of 2 fF/µm 2 capacitance. The size of the circuit is 14 µm × 19 µm, which achieves a fill factor of 74% in the same 32 µm × 32 µm pixel, compared to 86.3% with the conventional CTIA.

Simulation Results
To validate our smart pixel circuit, we simulated and tested it using a post-layout simulation of the circuit in Figure 14 using the 0.35 µm mixed-signal process. The simulation plot shown in Figure 15 depicts the main control signals NegInt, PosInt, and BuffSL (shown in Figure 5) and the voltage across the capacitors Cint1 and Cint2 during two consecutive video frames, while the SIS is operating in frame-difference mode. During the odd frame, NegInt and PosInt are set to 1 and 0, respectively, to configure the CTIA to operate as in Figure 8a,b. During the even frame, the circuit operates as in Figure 8c,d by setting NegInt and PosInt to 0 and 1. Within a frame, BuffSL switches the operation of the CTIA between the store and subtract mode. During the store phase of an odd frame, the capacitor voltage V Cint1 stars at zero and increases linearly with the photodetector current, while the voltage V Cint2 stores the pixel value of the previous frame. During the subtract phase, V Cint1 stays constant, and V Cint2 decreases linearly with the photodetector current. At the end of the phase, the output voltage of the CTIA is V CTI A = V Cint2 + V bias , which represents the difference between the pixel value in the odd frame and the previous even frame. During the next even frame, the role of the capacitors is reversed, and the circuit output is V CTI A = V Cint1 + V bias , which represents the difference between the pixel in the current frame and the previous odd frame.  Figure 15 shows that when BuffSL switches the capacitors C int1 and C int2 in the CTIA, the capacitor voltages show the effects of charge injection. This effect can be compensated in the A-THR block at each column by adjusting the threshold voltages V + THR and V − THR . The plot in Figure 16 shows a post-layout simulation of the A-THR comparator shown in Figure 9. As described above in Section 4.1, after computing the frame difference during the subtract phase, all pixels contain their respective frame-difference value. Figure 16 shows the input voltage of the A-THR in one column, and its outputs voltage while reading pixel values in 10 consecutive rows. During the readout and comparison phase, the controller sequentially reads the CTIA outputs of each row in the column. The column voltage is labeled V column in Figure 9. In this experiment, we sampled each row for 1 µs and circled in red the value of the pixels in each row. If V column is outside the threshold window, i.e., V column > V + THR or V column < V − THR , the A-THR outputs a logic 1. Otherwise, it outputs a logic 0. The shift register that captures the outputs of the A-THRs in each column operates at 320 MHz. The maximum frame rate achieved by the smart pixel array is given by the time to compute the frame-difference in the array (20 µs), plus the time to read 240 rows in parallel 240 µs. Therefore, the array can achieve a maximum frame rate of 3846 fps. At this frame rate, the smart pixel has a power consumption of 8.15 µW.

FPGA Implementation of the Digital Coprocessor
We used the SystemVerilog Hardware Description Language (HDL) at the Register-Transfer Level (RTL) to implement and validate the architecture of the digital coprocessor using the Xilinx Vivado 2020.1 development platform. In order to showcase the reduction in digital hardware resources enabled by the SIS, we targeted the low-cost entry-tier Xilinx Artix-7 XC7A35T FPGA. We compare our results to an FPGA-based Fully Digital Implementation (FDI) of the algorithm that uses a conventional image sensor. The FDI operates on 8-bit gray-scale pixels. All implementations use 5-bit labels and a 32-entry equivalence table in the connected components module. We consider two tests scenarios with different input image resolutions: 320 × 240 and 640 × 480 pixels. Table 1 shows the resource utilization of both implementations for both image resolutions. Our proposed coprocessor architecture using the SIS requires 5930 and 3929 Lookup Tables (LUTs) for the 640 × 480-and 320 × 240-pixel implementations, respectively. This represents 28.5% and 18.8% of the LUTs available on the XC7A35T FPGA. Our implementations also utilize 12% and 7.7% of the available FFs. No on-chip Memory Blocks (BRAMs) are required in our SIS-based architecture. When compared with the SIS approach, the FDI needs a frame buffer to compute the temporal differences between pixels in consecutive frames, which is implemented with BRAM to avoid using an external memory chip, which would limit the performance of the algorithm and increase the overall cost of the system. Indeed, the 320 × 240-pixel FDI requires only a small increase in the utilization of LUTs and FFs but uses 38% of the available BRAM. Moreover, the 640 × 480-pixel FDI requires more BRAM resources than those available on the FPGA, and thus could not be implemented on the selected device. The small hardware utilization of our SIS-based coprocessor leaves ample resources available, even on an entry-level device such as the XC7A35T FPGA. These resources could be used to implement the additional image-processing algorithm on the output produced by the SIS.  Table 2 shows the power consumption of the coprocessor estimated by Xilinx Vivado. Operating with the 20 MHz clock frequency imposed by the sampling rate of the SIS, the power consumption of our coprocessor is 27 mW and 34 mW for the 320 × 240-and 640 × 480-pixel inputs, respectively. The coprocessor can operate at up to 125 MHz, which enables it to operate at up to 1627 fps on 320 × 240-pixel images while consuming 58 mW and at up to 406 fps on 640 × 480-pixel images while consuming 61 mW. In comparison, the FDI with 320 × 240-pixel input consumes 39 mW at 20 MHz, and 97 mW at its maximum clock frequency of 104 MHz. Here, the power consumption of the frame buffer, implemented as on-chip memory, is nearly 50% of the total dynamic power. At this frequency, the FDI can operate at up to 1354 fps.

SIS Object Location Performance
To test the performance of the motion-based object-location algorithm on our SIS, we used the OSU Thermal Pedestrian [70] and the Terravic Motion IR [71] datasets. Both contain video sequences in the thermal IR range. Table 3 summarizes the image size in pixels, the number of video sequences, and the total number of images. To evaluate the object location performance of the SIS on each dataset, we used a simulation of the complete SIS circuit with post-parasitic extraction and the FPGAbased coprocessor described in Section 5.1. We developed a software implementation of the algorithm using floating-point arithmetic and used it as a baseline to evaluate the performance of the algorithm on the SIS. Figure 17 shows a visual comparison of the intermediate stages of the algorithm on the software and the analog section of the SIS. Figure 17a shows the image input, taken from IR security footage in the OSU dataset, which shows two pedestrians crossing the street. Figure 17b,c show the absolute frame-difference and thresholding computed by the software, and Figure 17d,e show the same stages of the algorithm output by the smart pixel array and A-THR module in the SIS. The figure shows that both implementations produce visually similar results, although the SIS output loses resolution, mainly due to the reduction in integration time.  We quantified the performance of the object location algorithm in the SIS implementation using the software implementation as a baseline. We used the Intersection over Union (IoU) index to estimate the accuracy of each bounding box output [72] and the average precision (AP), which measures the fraction of the objects in the image that are correctly located by the algorithm [72].
The IoU is defined in Equation (6) as: where SW is the ground truth given by the bounding box computed by the software implementation, and HW is the same bounding box computed by our SIS hardware implementation.  Table 4 compares the smart pixel array proposed in this work to other designs reported in the literature, discussed in Section 2, that implement object detection on an SIS [30,40,56]. We also include our own previous SIS designed for face recognition [43], which also uses an iROIC to implement pixel-level operations. The SIS presented in [40] detects objects using pixel-level processing to compute HOG features in an 8 × 8-pixel window. The processing circuits reduce the fill factor to 19%. The rest of the object detection algorithm is performed in a digital coprocessor and achieves an AP of 0.84. To improve the fill factor, the SISs in [30,56] move most or all the computation to the column level or to a coprocessor external to the imager. The SIS presented in [30] implements motion detection only to activate the digital coprocessor that performs object detection. The SIS combines pixel-and column-level processing to implement motion detection, and achieves a fill factor of 30% despite sharing capacitors between horizontally adjacent pixels. The coprocessor achieves an AP of 0.94. The SIS presented in [56] uses a digital coprocessor that operates at the column level, using an ADC for each column. Although it adds no additional circuitry at the pixel level, the die area used by the ADCs and coprocessors limits the fill factor to 60%. The digital coprocessor achieves an AP between 0.7 and 0.87, depending on the type of object detected.

Comparison to Related Work
Compared to works discussed above, our SIS achieves a frame rate that is significantly higher that those reported in the literature. This is mainly due to the parallelism exploited by our design at the pixel level and the fact that our column-level circuits have a single-bit digital output, which improves the readout time. Table 4 also shows that our fill factor is higher than those reported in the related work when using comparable CMOS processes. The main reason for this is that our SIS uses iROICs at the pixel level to compute the frame differences, which only add a capacitor and six extra switches to the conventional integration circuit. Finally, it is important to note that our design uses a CTIA to perform integration, which allow us to operate in the IR spectrum and low-light environments. The works reported in [30,40,56] only operate in the visible spectrum, but this allows them to use simpler pixel architectures with smaller die area.
The final column of Table 4 reports our own previous SIS [43] designed for face recognition, which uses an iROIC approach similar to this work. In consequence, the design achieves a similar fill factor, with slightly less area overhead because it uses only four switches and one capacitor per pixel. However, its maximum frame rate is significantly lower because it requires multiple reads per pixel to compute the features of the image at the column level.
Finally, we estimated a power consumption of 7.5 µW per pixel at 3846 fps for our design, which is higher than the power per pixel reported for other works in Table 4, although at a higher frame rate. The static power in the OPAMP of the CTIA is the main source of power dissipation and could be reduced by temporally powering down the CTIA when the array operates at a lower frame rate.

Conclusions
In this paper, we have presented the architecture and hardware implementation of an SIS for motion-based object location. The SIS uses a smart-pixel array with local memory to compute frame differences in the analog domain during pixel-current integration with high parallelism. It also uses an analog comparator and a digital coprocessor to compute image opening and connected components to detect objects from the frame-difference output of the smart-pixel array. We designed the smart pixel array and comparator at the layout level using the TSMC 0.35 µm and 0.18 µm mixed-signal CMOS processes and the digital coprocessor at the RTL level using SystemVerilog. We validated the design using post-layout simulations of the analog section and FPGA-based implementation of the coprocessor using a Xilinx XC7A35T FPGA.
Our results show that, using a 32 µm × 32 µm pixel, our design reduces the fill factor from 47.6% to 28% on the 0.35 µm process and from 86.3% to 74% on the 0.18 µm process, compared to a traditional imager. Because the integration time is reduced by 50% in framedifference mode, the pixel resolution is decreased. However, the circuit can still detect objects with a mean IoU of 0.92 and an AP of 0.9 averaged over two thermal IR datasets, using a software implementation as a baseline.
Computing the frame differences on the smart-pixel array eliminates the need for a frame buffer in the digital coprocessor. Indeed, our results show that the FPGA coprocessor in our SIS does not use on-chip memory blocks, while a fully digital implementation of the algorithm requires 19 memory blocks for 320 × 240-pixel images and 75 blocks for a 640 × 480-pixel input. The latter cannot be implemented on the entry-level XC7A35T FPGA, which features only 50 memory blocks. The digital coprocessor attached to the SIS also achieves a higher maximum clock frequency, and therefore a higher frame rate, than the digital implementation of the algorithm.
When we use integration capacitors as double-buffer memory to compute frame differences, we reduce the penalty on the fill factor compared to circuits that operate with readout-circuit output. Furthermore, although our smart pixel effectively uses half of the integration time, which could reduce the signal-to-noise ratio, our results are comparable to a software implementation of the motion-based object location algorithm.
The on-imager computation of our SIS is convenient in contexts where privacy is required, where it eliminates the need to continuously transmit video information over a communication channel. Instead, the SIS can deliver an alarm only when objects in motion are detected. Another example is the use of our SIS paired with a high-resolution camera where the SIS could detect objects based on motion and send the bounding boxes to an external controller, which could use them to activate the capture of that portion of the high-resolution image.  Data Availability Statement: This study uses the following publicly available datasets: OSU Thermal Pedestrian Database http://vcipl-okstate.org/pbvs/bench/Data/01/download.html, and Terravic Motion Infrared Database http://vcipl-okstate.org/pbvs/bench/Data/05/download.html. All datasets were last accessed on May 2022.

Conflicts of Interest:
The authors declare no conflict of interest.