CMOS Perceptron for Vesicle Fusion Classiﬁcation

: Edge computing (processing data close to its source) is one of the fastest developing areas of modern electronics and hardware information technology. This paper presents the implementation process of an analog CMOS preprocessor for use in a distributed environment for processing medical data close to the source. The task of the circuit is to analyze signals of vesicle fusion, which is the basis of life processes in multicellular organisms. The functionality of the preprocessor is based on a classiﬁer of full and partial fusions. The preprocessor is dedicated to operate in amperometric systems, and the analyzed signals are data from carbon nanotube electrodes. The accuracy of the classiﬁer is at the level of 93.67%. The implementation was performed in the 65 nm CMOS technology with a 0.3 V power supply. The circuit operates in the weak-inversion mode and is dedicated to be powered by thermal cells of the human energy harvesting class. The maximum power consumption of the circuit equals 416 nW, which makes it possible to use it as an implantable chip. The results can be used, among others, in the diagnosis of precancerous conditions.


Introduction
According to estimates by Gartner Research, in the next three years, approximately 75% of data will be processed outside the cloud, i.e., with the use of edge devices [1]. The use of edge devices is the basis of the concept of edge computing, i.e., a distributed data computing environment that works close to the data source (at the edge of the network). The main goal of this concept is to preprocess data before it is sent to the cloud, reducing latency and saving bandwidth [2]. It has been proven many times that edge computing is an approach that particularly increases the efficiency of the computing environment [3,4]. The techniques of edge-computing are one of the most frequently used approaches in the analysis of medical data [5]. The integration of artificial intelligence with mobile devices is used, among others, in hospital intensive care unit [6]. On the other hand, the development of edge-computing techniques makes it also possible to monitor the health of patients without the need to hospitalize them [7]. Examples of edge-computing applications in healthcare include areas such as: analysis and therapy for voice disorder [8], coronary heart disease diagnosis [9], detecting behavior of dementia [10], and even management of medical infrastructure [11]. Regardless of the application, all edge processing systems are characterized by computing efficiency suitable for processing a specific type of data close to its source. In this article, we focused on the implementation of an exocytosis signal processing system (the vital processes of the cell) using an analog CMOS (Complementary Metal-Oxide-Semiconductor) preprocessor. This is an example of analog data processing as close to the sensors as possible.
We took up this topic for two independent reasons. Primarily, the processing of exocytosis signals is an extremely important medical issue, as it allows for early diagnosis of pre-cancerous conditions, before they are visible in the histopathological images [12]. It also allows the diagnosis and monitoring of many other diseases. Exocytosis, its monitoring methods, medical applications and the related vesicular fusion process are described in the next section. The second reason why the authors decided to take up this topic are the engineering aspects of the problem. In this case, the source of medical data are individual patient cells, which requires the implementation of the preprocessor as an implantable chip class device [13]. Admittedly, hardware implementations of neural networks for applications in signal processing from sensors or in edge computing are known as: semiconductor circuits [14], based on DAC-based (Digital-to-Analog Converter) circuits [15] or memristor-based circuits [16]. The implementations mentioned operate in a frequency band suitable for the task of analyzing biomedical signals such as vesicle fusion signals. However, they are characterized by power consumption of several dozen to several hundred µW and are not suitable for processing current signals with an amplitude below 1 nA. In the case of analysis of vesicle fusion signals such the narrow range of signals observed by means of dedicated carbon nanotube (CNT) sensors necessitates the use of reduced supply voltage modes in the computing circuits. This is a big challenge, especially since the monitoring of precancerous conditions is a long process and should take place without the need to hospitalize the patient. This requires providing a system that, despite high computing efficiency, will be able to be powered by cells with low efficiency. Strong reduction of power consumption while maintaining high computing efficiency was the main goal of the presented approach. For all these reasons, the paper presents the methodology of designing an edge processing system that implements the functionality of a neural network for the classification of exocytosis signals. Particular emphasis was placed on ensuring low power consumption, with the possibility of supplying the system with thermal cells used in human energy harvesting techniques [17]. Efforts were also made in the network training phase to ensure minimal use of hardware resources which goes hand in hand with designing small, non-invasive computing systems.
The article is organized as follows. Section 2 describes the vesicular fusion process which is the basic mechanism of cell exocytosis. Various methods of monitoring fusion have been described and the sensor techniques used for this purpose. Section 3 describes the weak-inversion mode of CMOS semiconductor circuits. Its use in the implementation of an analog preprocessor for processing fusion signals near cells has been justified. Section 4 is devoted to the problem of training a neural network model for the fusion signal classification task. The limitations of the learning process resulting from the hardware implementation of the classifier were discussed. Section 5 presents parameters of the fusion signal classifier implemented as an analog CMOS system in weak-inversion. A summary of all the advantages of implementing analog CMOS circuits as edge devices is provided in Section 6.

Vesicle Fusion
In multicellular organisms, two basic processes of substance transport between cells take place: exocytosis and endocytosis [18]. For many years, research has been carried out to understand these mechanisms, as they constitute the basis of many life processes, making it possible to, among others, monitor and diagnose cancer [19], in particular cancer metastases [20]. There is also evidence linking these processes to precancerous conditions in cells [12]. Monitoring of exocytosis and endocytosis is also the basis for diagnosing Alzheimer's disease [21] or thrombosis [22]. The mechanism of exocytosis and endocytosis is based on the transport of the so-called vesicles [23], which carry chemical messages [24]. In case of exocytosis, vesicles move outside cells, and in case of endocytosis, inside cells. Communication between cells is based on both of these processes-a vesicle formed in the process of exocytosis in one cell is absorbed in the process of endocytosis in another cell. A particular contribution to understanding the vesicle-based transport process was done by James E. Rothman, Randy W. Schekman and Thomas C. Südhof, who for their research received the Nobel Prize in Physiology or Medicine in 2013. There are basically two types of vesicle fusion: a full fusion and a partial (kiss-and-run) fusion [25]. In case of the full fusion, the entire content of the vesicle is released in the process of exocytosis.
In the kiss-and-run fusion, only part of the vesicle content is released in the process of exocytosis, while the remainder is absorbed by the same cell in the process of endocytosis.
There are two methods of detecting vesicle fusions. The first approach is based on the analysis of image sequence obtained using Total Internal Reflection Fluorescence (TIRF) Microscopy [26]. This approach uses machine learning methods mainly based on Convolutional Neural Network [27] or Hierarchical Convolutional Neural Network (HCNN) [28]. The precision of both types of networks in the task of classification of the full and the partial fusion is higher than 95%. Due to the algorithm of deep networks and the need to analyze a sequence of photos, this approach is, however, very expensive for an application in edge computing systems. Alternative and less computationally expensive methods of detecting vesicle fusions are based on the analysis of amperometric signals [29]. An example of an amperometric AFE (Analog Front-End) system for monitoring vesicle fusion signals during exocytosis is described in paper [30]. The basis of the system is a sensor in the form of a Carbon Nanotube electrode array (CNT) [31], electrodes of which are attached directly to the tissue. This technology is covered by a NASA patent [32] and is being applied, among others, in medicine, early disease diagnosis, implantable sensors, analytical instruments [33]. A single AFE for the task of detecting fusions consists of three electrodes: a reference electrode (representing the reference potential), an anode and a cathode, which oxidize charged vesicles [34]. An application with voltammetric AFE [35] is also possible. In such a situation, the use of a voltage-to-current CMOS converter is additionally required [36]. Examples of current signals observed at the electrodes are presented in Section 4, describing the dataset used in the research for training the classifier of signals of the full and the partial fusion. The mentioned section presents the architecture of the perceptron network, which is the base for the classifier. It is worth noting that the computational complexity of the perceptron-based classifier is in this case much lower than that of systems used in the task of detecting fusion in an image sequence. This is because the calculations were moved as close as possible to the data source -near cells, next to the sensor matrix.

Weak Inversion Mode
This paper presents the implementation of the classifier as a CMOS circuit. Processing the biological signals of vesicle fusion requires the use of an unusual mode of operation of the semiconductor structure. CMOS circuits can work in three different modes depending on the supply voltage level [37]. Most applications use the strong inversion mode, which means that the gate voltage of the MOS transistor with respect to the source voltage (V GS ) satisfies the condition: V GS > V T + 100 mV, where V T is the technology threshold voltage. Lowering the V GS voltage to the range V T + 100 mV > V GS > V T − 100 mV will enter the moderate inversion mode [38]. By lowering the supply voltage further to the V GS < V T − 100 mV range, it enters the weak inversion mode. This mode is often used to implement analog computing circuits: multiplier/divider circuits [39], amplifiers [40] or comparators [41]. In this work, the weak inversion mode was used, which results from the need to process currents below 1 nA (typical for amperometric applications) and the need to ensure low power consumption (typical for implantable chips class applications). In weak inversion mode, the drain currents of the MOS transistors are described by the Equations (1) and (2) [37]: where I 0 is a process dependent constant, κ is the gate coupling coefficient and U T is the thermal voltage. For most CMOS technologies, these parameters take the following values: κ = 0.7, U T = 26 mV and I 0 has a value below 1 pA for NMOS transistor and below 10 fA for PMOS transistor [42]. Contrary to the strong inversion mode in the weak inversion mode, the drain current I D of the MOS transistor does not depend on the gate voltage in relation to the source potential (V GS ) and the drain voltage in relation to the source potential (V DS ), but directly on the potentials in the nodes: source (V S ), drain (V D ), gate (V G ) and in the case of of the PMOS transistor on the well potential (V W ). The exponential dependence of the drain current on the D, S, G, W potentials of the transistor results in an increase in the steepness of the drain characteristic in the triode region and a complete flattening of the drain characteristic in the saturation region. This makes it easier to drive current-mode circuits whose operating principle is based on maintaining the polarization of the transistors in the saturation area. The implementation presented in this paper is largely based on the use of current mirrors operating in a saturated mode. Additionally, it is worth emphasizing that in the weak inversion mode, the figure of merit described by the Equation (3) and understood as the relationship between the maximum frequency f MAX and the maximum power consumed P MAX , is more favorable than in the strong inversion mode.
This means that as the supply voltage is lowered, the power consumed decreases faster than the maximum processing frequency. This is another argument for using weak inversion mode in a data processing task close to its source. The FoM parameter for the structures used in this work is listed at the end of the current section.
To implement a neural network as a VLSI circuit, the following three circuits were used: a programmable multiplier described in paper [43] which implements the neuron weight, shown in Figure 1, a circuit described in paper [42] which implements a nonlinear neuron activation function, shown in Figure 2, and a circuit for removing the concurrent component (CMRR) described in paper [43] and shown in Figure 3.   The multiplier shown in Figure 1 consists of two series-connected six-output current mirrors controlled by twelve keys. The output current depends on setting the keys and its value is in accordance with Equation (4).
Parameters of scaling factors w1i, w2j are selected during the design stage of the multiplier IPcore. Their value depends on the ratio of the device transconductance of transistors at the input stage and at the ior joutput stage according to the principle described in more detail in paper [43]. At the implementation stage of the neural network using IPCores, B1, B2 bit words are selected so that the multiplier implements the neuron weight learned using the framework, which is described in detail in Section 4.3.
As for the hardware implementation of nonlinear neuron activation function, the most commonly used method is to estimate the function using a linear function continuous on intervals. While this implementation is generally satisfactory with respect to the fidelity of reconstructing the hyperbolic activation function, it appears to be relatively expensive. This implementation uses the circuit shown in Figure 2 designed and described by authors in paper [42]. The advantage of the circuit is its simple architecture based on only six transistors. However, the disadvantage is the need to use a dedicated activation function in the learning process. The activation function form is shown in Section 4.2, in which limitations resulting of using the function are discussed in greater detail.
The last circuit used for the implementation is the schematic shown in Figure 3 used to remove the CMRR (Common Mode Rejection Ratio) component described in paper [43]. Its structure is based on a circuit of three two-output current mirrors with scaling coefficients of 1 or 0.5. The principle of operation of the circuit is to determine the value of the common mode component and then to subtract it from the useful signal. This is done both for the non-negated signal and the negated signal simultaneously.
There are some limitations in using these IPcores. The structure of the final preprocessor must contain classic current mirrors with scaling factors of 1, whose task is to duplicate current signals in the perceptron. The circuit implementing weights introduces limitations concerning the permissible dispersion of weights in the network, while the circuit implementing the activation function imposes the need to use a dedicated activation function. The procedure of training the network with the above limitations will be described in more detail in the next chapter.
At the end of the description of the structures used, let us analyze their FoM parameter from the Equation (3). Figure 4b shows the dependence of the parameter on the supply voltage VDD for the circuit from Figure 4a, i.e., implementing a single neuron weight in a balanced structure, with removal of the common mode component. The analysis uses a mirror programmed to implement the highest weight, which means the activity of all output stages in the mirrors. The FoM parameter achieves the best value, i.e., the highest value, for the 0.3 V supply voltage. This is the supply voltage that was used to implement the perceptron.

Dataset Description
Data used to train the neural network consisted of artificially generated time series of currents, resembling CNT sensors' readings from vesicles. The dataset was generated using a script that was previously used in [30] for data generation purposes and is described there in more detail. Overall, our dataset consists of 3600 examples. Generated time series are of 180 ms in length and represent three classes: full fusion, kiss-and-run fusion and no fusion. Class examples are presented in Figure 5. Dataset was designed so that vesicle fusions are detected only in specific moments in time as mentioned in [30]. Translations of fusions larger than −10 to +10 ms were treated as no fusion class in addition to no-activity time series depicted in the last plot in Figure 5.

Architecture
A 10 ms sampling period of given time series was employed, which resulted in network input data being in the form of vectors with 18 elements. Network structure was designed with simplicity and minimization of resources in mind. Eventually a two-layer 12-2 architecture was settled on. As mentioned before, network's task was a three class classification but due to the minimization of resources, two neurons in the output layer were employed, one representing full fusion and the other kiss-and-run fusion class. These two output neurons technically perform the task of multilabel classification with labels in the form of binary vectors (00, 01, 10, 11). In multilabel classification one input can have multiple classes. We prevented that from happening though by using labels that have at most one class per example (00, 01, 10). Overall this technique allows us to use two output neurons instead of three in the last layer of the network, as it would have to be with a network performing multiclass classification with softmax activation. It is also worth noting that the third class, no fusion, was considered to be detected when both output neurons answered with values below chosen decision thresholds (examples for no fusion were given a 00 label).
Based on [42], all of the neurons have a uniform activation function described by the Equation (6) and Figure 6. This is not a classic sigmoidal function. It describes a relation between output and input currents from the circuit presented in Figure 2. This relation is also known as a static characteristic of a circuit. It's parameters were matched based on IPcore simulation results. The main advantage of using this circuit is its size. It is made of only six transistors. The learning process with a dedicated activation function is more difficult, but the complexity of the system is ultimately much less. For the same reason, none of the neurons had any biases. This made the learning process a bit more difficult, but it reduced the number of required multipliers in the hardware implementation. Overall, the network contained 240 trainable parameters.

Learning
The network was trained using Tensorflow Keras framework. The approach to training was unorthodox. Training was performed with sigmoid activation function in the last layer and custom activation from Figure 6 in the first layer. Sigmoid activation returns values from the finite range of <0, 1> unlike the custom activation. Range <0, 1> is needed for crossentropy loss functions and metrics such as AUC (area under the curve) to function correctly. They require input values (output of the network) to be of finite range of <0, 1>. This approach however, enforced translation of the class decision thresholds back to the custom activation function post-training. This method was only possible due to both activation functions being centered around zero and having similar monotonicity.
Before training, a hard constraint on the weights of the network had to be taken into account. The constraint was that the weights need to reside in two compartments: <−1.136377, −0.02> and <0.02, 1.136377>. This limitation results from the range of the multiplier coefficient realized by the circuit from Figure 1. It is possible to increase this range, but it would have to result in adding additional output stages in the circuit. This is disadvantageous as the subsequent stages would have to have extremely long transistor channels. The authors decided to present an implementation with a small use of the area, for applications in thermoelectric energy harvesting techniques. For training purposes, the weights were limited to the <−1.136377, 1.136377> range. After training, however, those from (−0.02, 0.02) range were manually set to zeros. This approach resulted in a small loss on testing metrics. When it comes to negative weight values, guaranteeing them in a balanced structure is not a problem and is limited only to modifying the routing of negated and non-negated signals.
Binary crossentropy was chosen as the loss function and Adam as the optimizer with the learning rate of 0.0001 and batch size equal to 8. In both layers, L2 regularization was employed during training to penalize large weights and promote small non-zero ones. The main metric that we used for measuring the network's performance was ROC AUC (Area under the ROC Curve). It has one advantage over standard accuracy-it does not assume the decision threshold to be static. This means that manual selection of the best threshold after training was needed. ROC AUC is the Riemann sum of the ROC curve (receiver operating characteristic curve). For each class there is a single ROC curve, and the final AUC score is averaged over all binary classes (in our case two classes: full fusion and partial fusion). ROC curve is a plot of TPR (true positive rate) to FPR (false positive rate) calculated for uniformly distributed class decision thresholds from <0,1> range for a given dataset. In our case, the step for subsequent thresholds equals 0.5, resulting in 200 data points on the ROC curve for each class. TPR and FPR equations are shown in Equations (7) and (8). TP (true positive), FN (false negative), TN (true negative) and FP (false positive) stand for the number of classifications of particular type on a given dataset with current decision threshold.
The dataset was split so that 2400 examples were used during training and 600 examples each were used for validation and testing. Network was trained for 272 epochs. It scored 99.31% ROC AUC on the test set. ROC AUC and loss over epochs are presented in Figures 7 and 8, while weights' histograms over epochs are presented in Figures 9 and 10.
Zeroing weights from (−0.02, 0.02) range caused ROC AUC to drop by approximately 0.2% on the test set. Final ROC curves for full fusion and partial fusion classes are shown in Figures 11 and 12.
The procedure of zeroing weights from the (−0.02, 0.02) compartment also caused some neurons to have all input connections with weights equal to 0 so they were removed. This resulted in network shrinking in size from 12-2 architecture to 8-2. Lastly, sigmoid activation functions in the last layer were changed back to custom activation from Figure 6 and optimal decision thresholds for output layer neurons were calculated. They equalled 1.05 nA for full fusion and 0.9975 nA for kiss-and-run fusion. When a neuron that was assigned a class answers with a current equal or above the threshold, class is considered to be detected. Final accuracy of the network with both class thresholds equalled 96.17%.

CMOS Classifier Parameters
This section describes the parameters of the CMOS circuit implementing the functionality of the aforementioned perceptron, which is a fusion signal classifier. The circuit is made of the IPcores described in Section 3. To implement the circuit we used the TSMC 65 nm CMOS technology. The process of generating the network structure was partly performed using proprietary EDA (Electronics Design Automation) tools [44]. The final circuit works in a balanced structure, consists of 20 blocks implementing the activation function and 320 reconfigurable multiplier blocks. Additionally, the structure embeds 320 blocks for removing the concurrent component. Due to the use of the current mode, it is also required to use 36 8-output mirrors and 16 2-output mirrors to duplicate the signals connecting the perceptron layers. The simplified connection diagram of the circuits in the preprocessor is shown in Figure 13. The diagram corresponds to the two-layer structure of the classifier and contains in each of the two layers a cascade of four blocks: duplicators, multipliers, CMRR blocks and circuits that perform the activation function. The number of connections in individual preprocessor stages is shown in the diagram in rectangular frames drawn with a dashed line. Each of these connections carries current signals. The entire preprocessor is made up of 52,904 transistors in total. The analyses were performed using the Eldo simulator which is part of the Mentor Graphics software. Figure 14 shows the example response of the implemented CMOS circuit to signals from the electrodes for two fusion cases: full and partial. According to the adopted perceptron learning method, fusion classification is carried out by exceeding the classification threshold. A new common threshold has been established for both types of fusion at 595 pA-which is marked on the output waveforms in Figure 14. In the presented example, for both cases of fusion, only one of the outputs exceeds the classification threshold.
The classifier is supplied with the voltage of 0.3 V. The range of the analyzed input currents is limited to range <−1 nA, 1 nA> and the range of output currents to range <−1.03 nA, 1.03 nA>. The circuit samples fusion signals with a frequency of 875 samples/s. The average power consumption of the circuit is 410 nW, and the maximum power consumption is 416 nW. The FoM parameter from Equation (3) is 2.098 1 nJ based on average power consumption. The area of the active part of the circuit equals 1.429 mm 2 . The estimated area of the thermoelectric cell for human energy harvesting techniques needed to power the preprocessor equals 1.39 mm 2 with a typical efficiency of 30 µW cm 2 , which is 97% of the active part area. This means that the cell required for the full operation of the preprocessor is of comparable size to the surface of its topography. Therefore, the above implementation can be classified as a circuit of the implantable chips class operating without an external power source. The applications presented in the literature provide the power consumed in terms of the number of calculation channels (NC). For example, there was presented an analog multilayer perceptron for a portable electronic nose application with the power consumption per channel equal to 27.65 µW [45]. In another biologically-inspired approach for pattern recognition, the power consumption per channel was achieved at the level of 17 nW [46]. As the third example, we can give a semiconductor implementation of a network with the WTA (Winner Takes All) mechanism, for which the power consumption per channel equals 18.3 µW [47]. As for the implementation of the preprocessor described in the current article, the number of channels is 320, which is the number of multipliers. The preprocessor power consumption per number of channels is 1.28 nW, which is at least an order of magnitude lower power consumption compared to similar implementations.
The accuracy of the classifier was verified on a set of 150 full fusion patterns and 150 partial fusion patterns. Sets of negative patterns corresponded to them-150 patterns in each of the sets. The average accuracy of the classification was 93.67%. This means a decrease in the accuracy of the CMOS classifier by only 2.5% compared to the learned model. The avarage precision determined on the same set is 96.33%. This is comparable to advanced methods based on deep networks analyzing image sequences, which allow, based on [27], a precision of 95.0% for full fusion and 96.7% for partial fusion, and based on [28], a precision of 95.2% for full fusion and 96.1% for partial fusion. These parameters were obtained using an analog implementation of a very simple two-layer network consisting of 10 neurons.

Conclusions
The paper presents a hardware CMOS implementation of a perceptron network algorithm for the task of processing medical data in the form of time waveforms. The approach described in the paper is an example of edge processing, as the data is processed very close to the data source, i.e., immediately after it is obtained from the cells. The work uses techniques to obtain the best possible ratio of data processing speed to power consumed. In particular, the following can be enumerated: using a simple perceptron network as a classifier, the use of simple CMOS structures to implement a perceptron, limiting the dispersion of weights and omitting biases in the network training process and finally the use of the weak inversion mode. Despite the use of a very simple perceptron network architecture the accuracy of the classifier is comparable to much more complex machine learning methods. The preprocessor consumes so little power that no additional power source is required, apart from typical thermal cells of the human energy harvesting class. Both the precision parameters of the classifier itself and the electrical parameters of its hardware implementation fit it into the class of TinyML solutions [48], which are the next step in the development of the Edge-AI concept. Medical data processing applications using implantable chips, due to their low production costs and low power requirements, can improve the level of medical care in many countries with enormous challenges in terms of connectivity, energy and cost [49]. The results of the work can be used in the early diagnosis of precancerous conditions thanks to the possibility of tracking life processes in cells. The approach described in the paper is a consensus between computational efficiency and consumed power. It increases the patient's comfort thanks to the significant reduction of the system dimensions (both the size of the preprocessor and the dimensions of the power source). This approach will certainly work well in various wearable devices applications, especially in combination with the sensor fusion concept. This is a further area of the authors' research.

Data Availability Statement:
No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: