A Quantized CNN-Based Microfluidic Lensless-Sensing Mobile Blood-Acquisition and Analysis System

This paper proposes a microfluidic lensless-sensing mobile blood-acquisition and analysis system. For a better tradeoff between accuracy and hardware cost, an integer-only quantization algorithm is proposed. Compared with floating-point inference, the proposed quantization algorithm makes a tradeoff that enables miniaturization while maintaining high accuracy. The quantization algorithm allows the convolutional neural network (CNN) inference to be carried out using integer arithmetic and facilitates hardware implementation with area and power savings. A dual configuration register group structure is also proposed to reduce the interval idle time between every neural network layer in order to improve the CNN processing efficiency. We designed a CNN accelerator architecture for the integer-only quantization algorithm and the dual configuration register group and implemented them in field-programmable gate arrays (FPGA). A microfluidic chip and mobile lensless sensing cell image acquisition device were also developed, then combined with the CNN accelerator to build the mobile lensless microfluidic blood image-acquisition and analysis prototype system. We applied the cell segmentation and cell classification CNN in the system and the classification accuracy reached 98.44%. Compared with the floating-point method, the accuracy dropped by only 0.56%, but the area decreased by 45%. When the system is implemented with the maximum frequency of 100 MHz in the FPGA, a classification speed of 17.9 frames per second (fps) can be obtained. The results show that the quantized CNN microfluidic lensless-sensing blood-acquisition and analysis system fully meets the needs of current portable medical devices, and is conducive to promoting the transformation of artificial intelligence (AI)-based blood cell acquisition and analysis work from large servers to portable cell analysis devices, facilitating rapid early analysis of diseases.


Introduction
Currently, cell analysis plays an important role in the diagnosis and efficient evaluation of diseases, and the demand for mobile real-time detection of cells for personalized biomedical diagnosis will likely increase in the future [1]. However, with the traditional method, test samples must be prepared on glass slides and manual analysis by counting cells with a microscope is required. There are many problems with this, such as the large equipment volume, dependence on the operator's professional knowledge, and the big difference in inspection results by different people [2]. The large equipment volume is inconvenient to move and expensive, preventing analysis anytime and anywhere. Furthermore, all analysis depends on professional operators, which is also costly and can easily lead to subjectivity among human assessors. Therefore, miniaturization and automation of blood analysis equipment is the expected development trend. degradation. A more meaningful attempt to quantize model architectures is required, which is already proven have a good performance in latency and accuracy.
In this paper, a mobile CNN lensless-sensing blood cell analysis hardware system was built to address the above issues. Our specific contributions are: • A quantization algorithm for mobile hardware implementation, which supports different kernels with different quantization parameters, and has an optimal tradeoff between classification accuracy and hardware cost. (Section 2) • A quantization circuit architecture for the quantization scheme. (Section 3) • A dual register group structure to allow for pipelining of a quantized CNN architecture, thereby increase its throughput. (Section 4) • A microfluidic chip and mobile lensless blood cell image acquisition device to build an entire mobile lensless microfluidic blood image acquisition and analysis system. (Section 5) • Implementation of the quantization architecture in FPGA, and application of the cell segmentation and cell classification CNN in the system to demonstrate a blood cell segmentation and classification analysis task. (Section 6) • The first miniaturization of a quantization CNN-based microfluidic lensless-sensing white blood cell (WBC) analysis system. This system has a significant tradeoff that enables miniaturization while retaining accuracy, and this promotes the research on mobile artificial intelligence (AI) diagnosis equipment.

Quantization Algorithm
This section describes our quantization algorithm, which is optimized for hardware implementation. In order to demonstrate the mapping relationship between the algorithm and the circuit more clearly, the whole algorithm description process is consistent with the data flow of the hardware in the next section.

Quantization Scheme
We used floating-point arithmetic in the CNN training, and integer arithmetic in the inference process, and they maintained a high degree of correspondence with each other.
The equivalent between the bit-representation of the value (denoted q below) and interpretation of the mathematical real value (denoted r below) is defined as: Equation (1) is the feature data quantization formula, and Equation (2) is the weight quantization formula, where suffix d denotes the feature data, and suffix w denotes weight.
For X-bit quantization, s, q and o are quantized as an X-bit integer. For example, in 8bit quantization, s, q and o are quantized as an 8 bit integer.
The constant s (for "scale") is an arbitrary positive real integer number. s2 −n can typically be represented as a floating-point quantity.

Feature Quantization Parameter o d s d n d Calculation
We now describe the calculation of the parameters s, n and o. The quantization methods are different for feature data quantization and weight quantization.
First, we obtained the maximum value (denoted r dmax ) and minimum value (denoted r dmin ) of the feature data range. If we take an 8 bit quantization as an example (int8), then the maximum and minimum values become 127 and −127, respectively. The corresponding quantized values, q dmax and q dmin , can be expressed as the maximum and minimum value of feature data quantization.
Subtracting Equation (3) from (4) results in: Rearranging Equation (5) then yields: From Equation (6), the parameter s and n for the feature data can be calculated. Thereafter, the resulting parameters n and s can be inserted into Equations (3) and (4) to compute the parameter o.

Weight Quantization Parameter s w n w Calculation
Our quantization algorithm uses different quantization parameters for different weight cube arrays. Next, we describe the calculation of the parameters s and n. First, we found out the maximum value (denoted r wmax ) and minimum value (denoted r wmin ) in every weight cube array. If we take an 8 bit quantization as an example (int8), then the maximum and minimum values become 127 and −127, respectively. The corresponding quantized values, q wmax and q wmin , can be expressed as the maximum and minimum value of weight quantization.
Subtracting Equation (7) from (8) results in: (r wmax − r wmin )2 n w s w = 254 Rearranging Equation (9) then yields: From Equation (10), the parameters s and n for the weight cube array can be calculated. Every weight cube array has its own quantization parameter.

Convolution Calculation
From the definition of the feature data quantization (1) and weight quantization (2) formulas, we have: q w = r w 2 n w s w Then, the convolution MAC (multiply-accumulate) calculation becomes: In Equation (13), we denote the convolution results q w q d as q conv , and denote q w o d as o conv , and denote r w r d as r conv , and denote s w s d as s conv , and denote n w + n d as n conv . Then, Equation (13) can be rewritten as:

Unify Weight Cubes Convolution Result
Because our quantization algorithm uses different quantization parameters for different weight cube arrays, after convolution for each weight cube array, the convolution result of all weight cube arrays should unify to the same quantization parameter.
As we described above, the result of the convolution equation is Equation (14). Let the target quantization parameters be s tgt ; o tgt ; n tgt . The quantization equation is implemented here as: Then we have: We use s chg ; o chg ; n chg to denote the quantization parameters for changing the current weight cube array convolution result to the target quantization parameter. From Equation (16), we can calculate the value of s uni ; o uni ; n uni as below.

Bias Operation Quantization
For models which use biases, there is an addition operation. The addition operation in batch normalization is also implemented here.
To remove parameter o when the bias operation finishes, we add parameter 2 sft into the bias quantization formula.
After unifying the weight cube array to target the quantization parameters, the bias addition equation is: The bias value is quantized as int8, in reference to the quantization method of the feature data. Then: Then let: From Equations (20)-(25), the parameters s b ; n b ; o b ; sft for bias can be calculated. Equation (20) then becomes: The result for the bias can then be written as: The addition of the bias and the batch normalization eliminate the offset parameter o.

Batch-Normalization Multiplication Quantization
For models which use batch normalization [27], there is a multiplication operation. Let q m denote the quantized multiplication value, and the multiplication operator quantization formula is: Therefore, the multiplication equation is quantized as: q bias × q m = r bias s bias 2 n bias × r m 2 n m s m = r bias r m 2 n bias +n m s bias s m The multiplication value also quantized as int8. From Equations (21), (22), (24) and (29), the parameter s m ; n m for multiplication can be calculated.
Then, the result of batch normalization multiplication can be rewritten as:

Layer Output Quantization
After calculating per layer, the bit width of the resulting value is more than 8 bit, and the output per layer will be the next layer's input data. Therefore, the calculation result needs to be quantized into int8 before being output.
Let q out denote the quantized output value, and the output value quantization formula is: Because the q after bias and batch normalization is: Inserting Equation (32) into (31) results in: Like the quantization of the feature data in Section 2.2, let Subtracting Equation (34) from (35) gives: The output value is quantized as int8. Then, in reference to the quantization method of the feature data, from Equations (33)-(36), the parameters s out ; n out ; o out for the output quantization can be calculated.

Quantization CNN Circuit Architecture
We propose a quantization CNN accelerator circuit architecture for the above quantization scheme. The architecture is shown in Figure 1.  All the quantization parameters are calculated offline. The quantization parameters include ox nx sx in each phase, and sft values for the shifter module. The calculation method of each quantization parameter is described in detail in the previous algorithm introduction section.
According to our proposed quantization Equations (1) and (2), the quantization operation is easy to implement in hardware. The structure of the quantization operation unit is shown in Figure 2. All the quantization parameters are calculated offline. The quantization parameters include o x n x s x in each phase, and sft values for the shifter module. The calculation method of each quantization parameter is described in detail in the previous algorithm introduction section.
According to our proposed quantization Equations (1) and (2), the quantization operation is easy to implement in hardware. The structure of the quantization operation unit is shown in Figure 2. All the quantization parameters are calculated offline. The quantization parameters include ox nx sx in each phase, and sft values for the shifter module. The calculation method of each quantization parameter is described in detail in the previous algorithm introduction section.
According to our proposed quantization Equations (1) and (2), the quantization operation is easy to implement in hardware. The structure of the quantization operation unit is shown in Figure 2. The workflow of the quantized CNN acceleration circuit is as follows. The quantization parameters of all modules were stored in the configuration register unit. After the Feature_data and weight were input, the feature_quantizatier and weight_quantizatier quantized the feature data and weight into 8 bit.
The MAC_array and accumulation unit (ACCU) completed the convolution operation and the result became 32 bit. Since the quantization parameters of each kernel cube array were different, after each kernel convolution operation was completed, all the different quantization parameters of all kernel cube arrays needed to be unified by Unify_quantizatier.
The bias and bn operations were completed by the quantize adder and multiplier, and the value became 41 bit. Then, the activation operation was completed through the ReLU unit.
The value from the ReLU unit was quantized to 8 bit by the output quantization unit, and then the pooling operation was completed by the pooling_unit.

Dual Register Group Structure
The classical neural network acceleration operation flow is that a layer requires the computational results of its preceding layer. This dependency is illustrated in Figure 3, indicating how substantial time is lost when waiting for the computation of a layer to finish. The workflow of the quantized CNN acceleration circuit is as follows. The quantization parameters of all modules were stored in the configuration register unit. After the Feature_data and weight were input, the feature_quantizatier and weight_quantizatier quantized the feature data and weight into 8 bit.
The MAC_array and accumulation unit (ACCU) completed the convolution operation and the result became 32 bit. Since the quantization parameters of each kernel cube array were different, after each kernel convolution operation was completed, all the different quantization parameters of all kernel cube arrays needed to be unified by Unify_quantizatier.
The bias and bn operations were completed by the quantize adder and multiplier, and the value became 41 bit. Then, the activation operation was completed through the ReLU unit.
The value from the ReLU unit was quantized to 8 bit by the output quantization unit, and then the pooling operation was completed by the pooling_unit.

Dual Register Group Structure
The classical neural network acceleration operation flow is that a layer requires the computational results of its preceding layer. This dependency is illustrated in Figure 3, indicating how substantial time is lost when waiting for the computation of a layer to finish.  The use of a circuit without a pipeline structure has a low throughput rate. For this case, we propose a dual-register pipeline circuit structure, which can further enhance the calculation efficiency of the acceleration circuit. Every module has its two register groups (Reg_Grp_A and Reg_Grp_B) and register group selection unit (RSU), which is primarily used to store the network parameter configuration of the current layer and next layer. The structure is shown in Figure 4. The use of a circuit without a pipeline structure has a low throughput rate. For this case, we propose a dual-register pipeline circuit structure, which can further enhance the calculation efficiency of the acceleration circuit. Every module has its two register groups (Reg_Grp_A and Reg_Grp_B) and register group selection unit (RSU), which is primarily used to store the network parameter configuration of the current layer and next layer. The structure is shown in Figure 4.
The use of a circuit without a pipeline structure has a low throughput rate. For this case, we propose a dual-register pipeline circuit structure, which can further enhance the calculation efficiency of the acceleration circuit. Every module has its two register groups (Reg_Grp_A and Reg_Grp_B) and register group selection unit (RSU), which is primarily used to store the network parameter configuration of the current layer and next layer. The structure is shown in Figure 4. The dual register group structure allows each module to perform the next layer immediately after the current layer calculation. As shown in Figure 5, the waiting time of each module is greatly reduced, and the calculation efficiency is improved. The dual register group structure allows each module to perform the next layer immediately after the current layer calculation. As shown in Figure 5, the waiting time of each module is greatly reduced, and the calculation efficiency is improved. A three-layer network is taken as an example to illustrate the working flow of the circuit: (1) Configure the parameters of the first layer and the second layer into Reg_Grp_A and Reg_Grp_B, respectively. (2) The parameters in Reg_Grp_A are applied to perform the current layer operation, and a completion signal will be sent to RSU when the operation is accomplished. (3) After the RSU receives the first layer operation completion signal, the parameters in the Reg_Grp_B are loaded for the second layer calculation and the parameters of the third layer are configured into the Reg_Grp_A. (4) When the operation of the second layer is completed, the operation completion signal is also sent to the RSU again. (5) After the RSU receives the second layer operation completion signal, the parameters in the Reg_Grp_A are loaded for the third layer calculation. (6) After the calculation of the third layer is completed, it is also the end of three-layer network calculation.
In this structure, each module is pipelined, which improves the computational efficiency of the CNN accelerator.

Cell Segmentation and Classification CNN
In medical image analysis, cell segmentation is one of the most basic and important research tasks. It is also the basic premise of cell image recognition and counting. In our previous studies, we proposed two algorithms: blood cell image-segmentation based on a convolutional neural network [28], A three-layer network is taken as an example to illustrate the working flow of the circuit: (1) Configure the parameters of the first layer and the second layer into Reg_Grp_A and Reg_Grp_B, respectively. (2) The parameters in Reg_Grp_A are applied to perform the current layer operation, and a completion signal will be sent to RSU when the operation is accomplished. (3) After the RSU receives the first layer operation completion signal, the parameters in the Reg_Grp_B are loaded for the second layer calculation and the parameters of the third layer are configured into the Reg_Grp_A. (4) When the operation of the second layer is completed, the operation completion signal is also sent to the RSU again. (5) After the RSU receives the second layer operation completion signal, the parameters in the Reg_Grp_A are loaded for the third layer calculation. (6) After the calculation of the third layer is completed, it is also the end of three-layer network calculation.
In this structure, each module is pipelined, which improves the computational efficiency of the CNN accelerator.

Cell Segmentation and Classification CNN
In medical image analysis, cell segmentation is one of the most basic and important research tasks. It is also the basic premise of cell image recognition and counting. In our previous studies, we proposed two algorithms: blood cell image-segmentation based on a convolutional neural network [28], and the classification of white blood cells by CNN [29]. We applied both to the accelerator, and introduce them in this section.
Pixel size is one of the key factors affecting image quality in lensless imaging systems. The resolution of the cell image captured by the lensless system is lower than that captured by an optical microscope. Moreover, the noise is larger and the boundary information is more ambiguous. Noise-sensitive image thresholding and cell segmentation methods are not suitable for this scenario.
In order to addresses these issues, we have optimized the Unet [30] model. The optimized model is called CSnet (Cell Division Network) as shown in Figure 6. According to the imaging characteristics of a lensless imaging system, a series of convolution output features are compensated in time. The CSnet network structure also solves the problems of the long running time and high storage space requirement of the training system, which are not conducive to porting applications. We compared CSnet, Unet and the adaptive threshold segmentation algorithm, and the results are shown in Figure 7. In Table 1, four evaluation criteria (Jaccard, confirm-index, precision and recall) are applied to quantify the results of segmentation. In the field of image segmentation, a higher score means a better effect of segmentation, and a score of 1 means perfect consistency in the segmentation. Unet and CSnet are compared with these four evaluation criteria in Table 1, and the better result was obtained by CSnet. White blood cells (WBCs) are the nucleated cells in peripheral blood. Their quantity is only 0.1~0.2% that of red blood cells (RBCs). At present, the most advanced algorithm in image classification is CNN, and this method fully meets the application requirements for detecting WBCs. Therefore, we also employed our proposed CNN architecture to classify WBCs. The specific network structure is shown in Figure 8. We compared CSnet, Unet and the adaptive threshold segmentation algorithm, and the results are shown in Figure 7.  In Table 1, four evaluation criteria (Jaccard, confirm-index, precision and recall) are applied to quantify the results of segmentation. In the field of image segmentation, a higher score means a better effect of segmentation, and a score of 1 means perfect consistency in the segmentation. Unet and CSnet are compared with these four evaluation criteria in Table 1, and the better result was obtained by CSnet. White blood cells (WBCs) are the nucleated cells in peripheral blood. Their quantity is only 0.1~0.2% that of red blood cells (RBCs). At present, the most advanced algorithm in image classification is CNN, and this method fully meets the application requirements for detecting WBCs. Therefore, we also employed our proposed CNN architecture to classify WBCs. The specific network structure is shown in Figure 8. In Table 1, four evaluation criteria (Jaccard, confirm-index, precision and recall) are applied to quantify the results of segmentation. In the field of image segmentation, a higher score means a better effect of segmentation, and a score of 1 means perfect consistency in the segmentation. Unet and CSnet are compared with these four evaluation criteria in Table 1, and the better result was obtained by CSnet. White blood cells (WBCs) are the nucleated cells in peripheral blood. Their quantity is only 0.1~0.2% that of red blood cells (RBCs). At present, the most advanced algorithm in image classification is CNN, and this method fully meets the application requirements for detecting WBCs. Therefore, we also employed our proposed CNN architecture to classify WBCs. The specific network structure is shown in Figure 8. White blood cells (WBCs) are the nucleated cells in peripheral blood. Their quantity is only 0.1~0.2% that of red blood cells (RBCs). At present, the most advanced algorithm in image classification is CNN, and this method fully meets the application requirements for detecting WBCs. Therefore, we also employed our proposed CNN architecture to classify WBCs. The specific network structure is shown in Figure 8.  We used the LeNet-5 structure to design the algorithm. The input image of the input layer was processed by the proposed cell segmentation algorithm. In the convolution layer, the size of the convolution kernel was 3 × 3 × 8 and the step size was 1. The network structure had only one layer of convolution. Then, the ReLU activation function was used to improve the classification accuracy. The maximum pooling method was adopted, with a pooling window of 2 × 2 and a step size of 2. After the first pooling, the feature map size was 84 × 84 × 8, so the full connection layer size was 56,448. The output size was 3, which represented three kinds of WBCs: lymphocytes, mononuclear, and neutrophils. The probability of the three kinds of WBCs was calculated by using the soft Max function.

Mobile Microfluidic Acquisition Device
This section introduces the proposed/designed microfluidic acquisition device. The microfluidic acquisition device consists of two parts: the microfluidic chip and the lensless image-sensing module. The system operates as follows: (1) Place a microfluidic chip in the lensless image sensing device.
(2) A group of micropumps is used to control the flow rate to ensure that the lensless imaging device can acquire appropriate images. (3) The detected samples are injected into the microfluidic chip, and the data are collected by the lensless image sensing module.

Process Flow of the Microfluidic Chip
The fabrication of the microfluidic chip can be divided into two parts: the fabrication of silicon cathode chips and the fabrication of microfluidic chips. The process of silicon positive die is shown in Figure 9. First, we designed the required microchannel graphics and made the mask. Then the silicon wafers were cleaned by the standard cleaning process and the photoresist was coated. Finally, the masks and silicon wafers coated with photoresist were used for photolithography and development, and the preparation of the positive die was completed. We used type SU-8 photoresist. This photoresist ensures that a highly suitable microfluidic chip channel can be fabricated. cathode chips and the fabrication of microfluidic chips. The process of silicon positive die is shown in Figure 9. First, we designed the required microchannel graphics and made the mask. Then the silicon wafers were cleaned by the standard cleaning process and the photoresist was coated. Finally, the masks and silicon wafers coated with photoresist were used for photolithography and development, and the preparation of the positive die was completed. We used type SU-8 photoresist. This photoresist ensures that a highly suitable microfluidic chip channel can be fabricated. After the fabrication of the silicon positive die, the microfluidic chip was fabricated by the process shown in Figure 10. The whole process can be divided into the following four steps.
(1) Using trimethylchlorosilane to perform surface modification on the positive die.  After the fabrication of the silicon positive die, the microfluidic chip was fabricated by the process shown in Figure 10. The whole process can be divided into the following four steps. The fabricated microfluidic chip is shown in Figure 11.

Design of the Lensless Image-Sensing Module
The lensless image sensing module mainly consists of a light source and lensless-imaging sensor. The light source for the lensless imaging technology can be divided into the point light source and surface light source. The point light source has strong diffraction and is suitable for scenes requiring diffraction recovery. Small diffraction and clear imaging are characteristics of the surface light source. Considering the arithmetic operation and imaging quality, the surface light source was chosen as the system light source. The commercial OV5640 sensor was selected as the image sensor. The sensor had a pixel size of 1.4 um and a photosensitive area of 3673.6 um × 2738.4 um.
We used a 0.1 mm precision 3D printer to complete the production of the system. The final microfluidic acquisition device is shown in Figure 12. The fabricated microfluidic chip is shown in Figure 11. The fabricated microfluidic chip is shown in Figure 11.

Design of the Lensless Image-Sensing Module
The lensless image sensing module mainly consists of a light source and lensless-imaging sensor. The light source for the lensless imaging technology can be divided into the point light source and surface light source. The point light source has strong diffraction and is suitable for scenes requiring diffraction recovery. Small diffraction and clear imaging are characteristics of the surface light source. Considering the arithmetic operation and imaging quality, the surface light source was chosen as the system light source. The commercial OV5640 sensor was selected as the image sensor. The sensor had a pixel size of 1.4 um and a photosensitive area of 3673.6 um × 2738.4 um.
We used a 0.1 mm precision 3D printer to complete the production of the system. The final Figure 11. Comparison between the microfluidic chip and a renminbi (RMB) coin.

Design of the Lensless Image-Sensing Module
The lensless image sensing module mainly consists of a light source and lensless-imaging sensor. The light source for the lensless imaging technology can be divided into the point light source and surface light source. The point light source has strong diffraction and is suitable for scenes requiring diffraction recovery. Small diffraction and clear imaging are characteristics of the surface light source. Considering the arithmetic operation and imaging quality, the surface light source was chosen as the system light source. The commercial OV5640 sensor was selected as the image sensor. The sensor had a pixel size of 1.4 um and a photosensitive area of 3673.6 um × 2738.4 um.
We used a 0.1 mm precision 3D printer to complete the production of the system. The final microfluidic acquisition device is shown in Figure 12. Figure 11. Comparison between the microfluidic chip and a renminbi (RMB) coin.

Design of the Lensless Image-Sensing Module
The lensless image sensing module mainly consists of a light source and lensless-imaging sensor. The light source for the lensless imaging technology can be divided into the point light source and surface light source. The point light source has strong diffraction and is suitable for scenes requiring diffraction recovery. Small diffraction and clear imaging are characteristics of the surface light source. Considering the arithmetic operation and imaging quality, the surface light source was chosen as the system light source. The commercial OV5640 sensor was selected as the image sensor. The sensor had a pixel size of 1.4 um and a photosensitive area of 3673.6 um × 2738.4 um.
We used a 0.1 mm precision 3D printer to complete the production of the system. The final microfluidic acquisition device is shown in Figure 12.

Experimental Results and Discussion
To verify the effect of the proposed integer quantization algorithm, we used different quantization bits in WBC classification network inference, and the quantization bit was from int2 to int8. The classification results of three kinds of white blood cells are shown in Figure 13. The more quantized the bits, the higher the accuracy, and 8 bit quantization is close to saturation.

Experimental Results and Discussion
To verify the effect of the proposed integer quantization algorithm, we used different quantization bits in WBC classification network inference, and the quantization bit was from int2 to int8. The classification results of three kinds of white blood cells are shown in Figure 13. The more quantized the bits, the higher the accuracy, and 8 bit quantization is close to saturation. Then, we compared the accuracy of different quantization bits with the floating point in the WBC classification network. The result is shown in Table 2. As expected, 8 bit integer quantization can achieve 98.44% accuracy, which is similar to 32 bit floating-point; the reduction in accuracy is only 0.56%. To verify the benefits of using the integer quantization structure for the CNN accelerator hardware circuit, we also designed fp16 and int16 precision circuits with the same parallelism as our int8 quantization circuit. We used Synopsys design compiler (DC) to synthesize three kinds of quantization circuit, and the results of circuit area comparison are shown in Table 3. It shows that quantization greatly saves the area of hardware circuits, and the improvement is significant. We defined the figure of merit (FoM) as below: For example, int16 FoM is 1/(0.49 × 0.6702) = 3.045. The comparison result is shown in Table 3. Then, we compared the accuracy of different quantization bits with the floating point in the WBC classification network. The result is shown in Table 2. As expected, 8 bit integer quantization can achieve 98.44% accuracy, which is similar to 32 bit floating-point; the reduction in accuracy is only 0.56%. To verify the benefits of using the integer quantization structure for the CNN accelerator hardware circuit, we also designed fp16 and int16 precision circuits with the same parallelism as our int8 quantization circuit. We used Synopsys design compiler (DC) to synthesize three kinds of quantization circuit, and the results of circuit area comparison are shown in Table 3. It shows that quantization greatly saves the area of hardware circuits, and the improvement is significant. We defined the figure of merit (FoM) as below:  Table 3. As a result, 8 bit integer quantization has the best FoM of 3.239, in which the accuracy reduction is only 0.56%, and the area reduction is 44.86%.
To verify the effect of the dual configuration register group, we carried out experiments on the WBC classification network and compared the operation time of the dual-reg mode and common mode. The frequency of the operation clock was 100 MHz. The simulation operation time results are shown in Table 4. As expected, the dual-reg mode greatly increases the overlap operation time between CNN layers, thereby reducing the total operation time by 28.25%. Then we implemented the CNN accelerator in the FPGA, and the FPGA was Xilinx Zynq UltraScale+ ZCU102. The maximum frequency was 100 MHz, Table 5 shows the experimental results of our int8 quantization and other previous quantization works [31,32]. Compared with the int8 quantization method [32], our int8 quantization method gives an increment of 0.67% on classification accuracy, and the hardware resource consumptions of look up table (LUT), flip-flop (FF), digital signal processor (DSP) and block random access memory (BRAM) were reduced by 18.54%, 24.87%, 71.07%, and 29.25%, respectively. Therefore, implementing our method helps to achieve a smaller hardware cost and better tradeoff between accuracy and hardware cost. We combined the FPGA CNN accelerator and the cell acquisition system to establish a complete CNN-based mobile lensless microfluidic blood acquisition and analysis prototype system, which is shown in Figure 14. We combined the FPGA CNN accelerator and the cell acquisition system to establish a complete CNN-based mobile lensless microfluidic blood acquisition and analysis prototype system, which is shown in Figure 14. Then we used our segmentation network to segment the blood images collected by the lensless microfluidic chip acquisition module. After the segmentation, we used the WBC classification network to classify the WBCs one by one. Finally, the classification results are displayed by statistics. The illustration of the touch panel is shown in Figure 15. Then we used our segmentation network to segment the blood images collected by the lensless microfluidic chip acquisition module. After the segmentation, we used the WBC classification network to classify the WBCs one by one. Finally, the classification results are displayed by statistics. The illustration of the touch panel is shown in Figure 15. The experiments showed that the system can acquire, segment and classify cell images correctly. When the FPGA frequency is 100 MHz, a classification speed of 17.9 fps can be obtained, and the accuracy is 98.44%. Compared with the floating-point method, the accuracy dropped by only 0.56%, but the area decreased by 45%.

Conclusions
This paper proposed a CNN-based mobile microfluidic lensless sensing blood acquisition and analysis system, and an integer-only quantization scheme is proposed and applied to this system. Compared to floating-point based detection, a better tradeoff between accuracy and hardware cost was achieved. The CNN microfluidic lensless-sensing blood-acquisition and analysis system fully meets the needs of current portable medical devices and is conducive to promoting the transformation of AI based cell acquisition and analysis work from large servers to portable cell analysis devices, facilitating rapid early analysis of diseases. With the proposed quantization and its The experiments showed that the system can acquire, segment and classify cell images correctly. When the FPGA frequency is 100 MHz, a classification speed of 17.9 fps can be obtained, and the accuracy is 98.44%. Compared with the floating-point method, the accuracy dropped by only 0.56%, but the area decreased by 45%.

Conclusions
This paper proposed a CNN-based mobile microfluidic lensless sensing blood acquisition and analysis system, and an integer-only quantization scheme is proposed and applied to this system. Compared to floating-point based detection, a better tradeoff between accuracy and hardware cost was achieved. The CNN microfluidic lensless-sensing blood-acquisition and analysis system fully meets the needs of current portable medical devices and is conducive to promoting the transformation of AI based cell acquisition and analysis work from large servers to portable cell analysis devices, facilitating rapid early analysis of diseases. With the proposed quantization and its hardware implementation, it was thus illustrated that a miniaturization of the system is indeed viable. An accuracy of 98.44% with 17.9 fps rivals that of bulky in-clinic systems and thus paves the way for miniaturization of such systems.

Conflicts of Interest:
The authors declare no conflict of interest.