Implementation of Pruned Backpropagation Neural Network Based on Photonic Integrated Circuits

: We demonstrate a pruned high-speed and energy-efﬁcient optical backpropagation (BP) neural network. The micro-ring resonator (MRR) banks, as the core of the weight matrix operation, are used for large-scale weighted summation. We ﬁnd that tuning a pruned MRR weight banks model gives an equivalent performance in training with the model of random initialization. Results show that the overall accuracy of the optical neural network on the MNIST dataset is 93.49% after pruning six-layer MRR weight banks on the condition of low insertion loss. This work is scalable to much more complex networks, such as convolutional neural networks and recurrent neural networks, and provides a potential guide for truly large-scale optical neural networks.


Introduction
In the past twenty years, deep neural networks have become a milestone in artificial intelligence for its superior performance in various fields, such as autonomous driving [1], stock forecasting [2], intelligent translation [3], and image recognition [4]. However, for the reason of enormous computation in matrix multiplication, traditional central processing units are gradually becoming suboptimal for implementing deep learning algorithms. Silicon photonics [5] provide superior performance in energy consumption [6,7] and computational rate [8] over electronics, which has become an attractive platform. Photonic integrated circuits can easily realize matrix multiplication with the coherence and superposition of linear optics [9]. The advantages of photonics have promoted extensive demonstrations of various high-performance optical functionalities, especially in photon computing [10][11][12][13][14][15]. The programmable Mach Zehnder interferometers realize optical interference units for optical neural networks [11,16], and MRRs are used for optical matrixvector multipliers to implement neuromorphic photonic networks [17][18][19]. In addition, diffractive optics [20][21][22], space-light modulators [23,24], semiconductor optical amplifiers [25], optical modulators [26,27], and other related optical devices are used to achieve powerful deep learning accelerators, which can perform machine learning tasks with ultra-low energy consumption [28].
The realization of large-scale optical neural networks is still a challenge, though matrix operations can be implemented with special optical components. Excessive integrated optical elements not only increase the manufacturing cost and loss of photonic integrated circuits but also make the adjustment of optical elements more complex [29]. Pruning has been widely used to simplify the model of neural networks [30]. As a typical method of model compression, it can solve the problem of over-parameterization effectively by removing redundant parameters.
In this paper, we demonstrate an optical neural network based on a pruned BP model. MRR banks are used for the matrix computing core can be pruned to the optimum size.
Results indicate that the accuracy of the proposed algorithm is 93.48% in the MNIST dataset with six-layer MRR weight banks pruned. Thus far, we have found that no one has used network pruning to compress optical components. This work can be derived to more complex networks and provides a feasible path toward truly large-scale all-optical neural networks.

Backpropagation Neural Network
BP [31] neural network is widely used for strong nonlinear mapping ability in classification. A typical BP neural network includes input layers, hidden layers and output layers. The number of nodes in the input and output layers are always fixed, while the nodes in hidden layers largely affect the performance of neural networks.
The training of a BP neural network includes two processes, forward propagation and backpropagation. Input data are calculated to produce output in forward propagation, which is similar to a typical neural network. The difference comes when expected outputs differ from actual outputs. In backpropagation, deviation is transmitted backward and forward and distributed to all nodes. Parameters of the network are corrected according to the deviation information so that the deviation decreases along the fastest gradient direction. The weights and biases of BP neural networks are updated as follows.
Output of layer L is defined by: where f is the activation function of neural networks, y L−1 represents the output of layer L − 1, w L and b L are weight and bias of layer L. The evaluation function is defined by: where N represents the total number of samples, t j and y j represent actual and predictive categories, respectively. The iterative formulas of weights and biases based on the gradient descent method in the BP model are given as: where η is defined as learning rate, a parameter used to control the convergence rate. The value of η is larger at the early stage of training, so weights and biases are updated at a faster speed. Then η is gradually reduced, which is helpful for the convergence of training.
Despite the huge success, a BP neural network needs a lot of computing capacity for its excess parameters. Therefore, network pruning and optical neural networks are introduced to remove redundant parameters and speed up inference.

Network Pruning
Over-parameterization is widely recognized as one of the basic characteristics of deep neural networks. It ensures that deep neural networks have strong nonlinear mapping ability at the expense of high computational cost and memory occupation. Pruning has been identified as an effective technique to solve this [32]. Weight-elimination [33] is a typical pruning method. It introduces a regularization term to represent structural complexity in network objective function, which can be used to make weights sparser.
Reference [34] introduces the minimum description length (MDL) to describe the complexity of machine learning. According to MDL, the error function is defined by: Err 1 has been given in Equation (3). Err 2 is as follows: where w 0 is called base weight with fixed value, λ is a dynamical parameter that can be adjusted. Different from the training of the BP model, regularization is introduced into weight adjustment in pruning, which is defined by: where Redundant weight decreases continuously until it is small enough to be deleted in training. A hidden node will be deleted when all of its output weight values are close to zero and incorporated into the offset node when all of its input weights are close to zero. Then, we get a simplified model, as shown in Figure 1. The red node is pruned for all its input weight values that are close to zero with its bias transmitted to the next layer.

Optical Components and Silicon Photonic Architecture
The MRR banks are the core of the matrix computing. An add-drop MRR consists of two straight waveguides and a circular waveguide, and the straight waveguide at the drop end is set in a curved shape in order to reduce crosstalk sometimes [35]. It has four ports, which are, respectively, called input, through, upload and drop port, as shown in Figure 2a, and z 1 and z 2 serve as coupling regions. Such a silicon waveguide can be fabricated by nanophotonic processing technology, which is compatible with standard complementary metal oxide semiconductor (CMOS) fabrication [36,37]. The add-drop MRR resonance condition is described by: The parameter θ represents the phase biases through a circular waveguide, and β as the propagation constant of light, which is mainly affected by wavelength λ and the effective index of refraction between the ring and waveguide n e f f . The parameters L and r represent the circumference and radius of the MRR, respectively. Mapping a current in an MRR can change the value of n e f f , yielding a shift of the resonance peak [38].
The transfer function of optical intensity going to the drop port from the input is represented by: The transfer function of the through port light intensity with respect to the input light is where parameters t 1 and t 2 represent the transmission coefficients of z 1 and z 2 , k 1 and k 2 represent the mutual coupling factors, respectively. The parameter α defines the loss coefficient in the ring waveguide. Figure 2c shows an MRR without resonance, and the light is all of the output to the through port. Figure 2d shows an MRR in the resonant state, and the light from the straight waveguide is transmitted to the ring. The effective refractive index between the waveguide and the MRR circle causes the phase shift of the light, which interferes with the intensity of the original light and finally outputs to the drop port. Assuming that the input light has an amplitude of E 0 , and the coupling losses are negligible, we can derive the following formulas for the light intensity of the drop and through ports: Figure 2b gives the transfer function of MRR with a radius of 5 µm , and the parameters of the coupling region are identical (k 1 = k 2 = 0.18). Different wavelengths of light have different resonance conditions, and the red curve represents the through port, and the blue one is the drop port.
In this work, we demonstrate the matrix operation of BP with MRR banks. Figure 3 illustrates an optical backpropagation network architecture. Suppose we handle twodimensional matrices D × R, we need R lasers with generating M wavelengths, where M is the type of pixel. We use R modulators to represent the value of each pixel, each of which keeps the light intensity of the corresponding carrier proportional to the serialized input pixel value [27]. Then, WDM will multiplex the R lasers and split them into D separate lines. There are D MRR-arrays, where each array has R MRRs on every line, and we can get a D × R weight matrix. Thus, the standard form of multiplication and accumulation in BP is represented by: where y is the output of layer d, A i is the light intensity of line i after modulating and multiplexing, F i , as a particular weight according to MRR weight banks and balanced photodiodes, can be described by: where g is the gain of an amplifier (TIA) to ensure that T d − T p is not limited in the range −1 to +1 [19]. The sum of E 0 F i is a predictable bias.

Photonic Pruning Neural Network
In this work, we trained a neural network based on a pruned BP model to perform image recognition on the MNIST dataset, and Figure 4 depicts this model in detail. The pruned BP model parameters are pretrained in a digital computer (PC). An optical neural network is used to perform matrix multiplication in inference. First, we prune the model during training based on weight-elimination, and weights with biases of the pruned model are uploaded to the optical neural network to calculate optical device parameters. The two-dimensional images of 8 × 8 in the MNIST dataset are converted into a one-dimensional image of 64 × 1, which are put into the photoelectric modulators and modulated with 64 optical carrier signals. When 64 multiplexed optical signals are transmitted through 50 wave wires to an array with 64 MRRs, the signal from the weight banks is output to the digital computer for nonlinear activation. Finally, the transformed vector is fed to the full connection layer with 10 nodes, where the result of the MNIST classification appears.

Results and Discussion
In the pruning experiments, we count the error and regularization coefficient of each training during pruning, as shown in Figure 5a,b. It is obvious that the error curve in the training process is relatively smooth, while the regularization coefficient is not, which means the process of weight differentiation does not lead to sharp fluctuations in training and has little effect on pruning results. However, we find that the pruning of small-scale neural networks is not always stable. The data of the experiment shows that the accuracy of the pruned model can still reach 85.25% with half of the original information retained. A small-scale neural network model is easy to fall into local extremum, which means that it is difficult to guarantee its global convergence in training. Although small-scale model compression for neural networks is feasible, the compression performance is not obvious compared with large-scale deep neural networks for the lack of redundant parameters. Figure 6 depicts the prediction accuracy of the BP network through pruning different nodes, and we can find, when 6 neurons are trimmed in the hidden layer with 50, the prediction accuracy is the best, reaching 95.39%.  Then, we demonstrate a three-layer optical backpropagation network architecture by pruning six layers of MRR banks. In general, the bits number in the simulator impacts the capabilities of the network to recognize new input data, and the performance of the optical neural network designed with equal or less than 4-bits is significantly diminished [19]. Hence, we use a 5-bit architecture and set the modulation rate to 5 GHz, and the running time of the input data is 320 ns without considering other power consumptions. Figure 7a shows the serialized input images (numbers 5, 3 and 1, respectively). Through the multiplication and accumulation of the optical architecture, the results of the MNIST dataset recognition based on the optical neural network of the pruned BP model are shown in Figure 7b. For the test of 1797 images, the overall recognition accuracy of the optical neural network is 93.49%.
This demonstration does not realize the training process of parameters and nonlinear activation operation on the optical structure, which are also two challenges to implement for all-optical neural networks. Consequently, we will consider using nonlinear units or other optical materials to realize on-chip training and nonlinear activation. For example, using the transpose matrix operation of the MRR crossbar arrays [39] to implement the on-chip training of the optical neural networks and building an electro-optic hardware platform [40] to implement nonlinear activation functions. It is noteworthy that the opticalto-optical nonlinearity is realized by converting a small portion of the input optical signal into an analog electric signal, which is used to modulate the intensity of original optical signal with no reduction in speed of processing. This activation function is reconfigurable via electrical bias, which allows it to be programmed or trained to synthesize a variety of nonlinear responses. Furthermore, passive optical elements [41] achieve the backpropagation gradients required in the training process by using saturable absorption for the nonlinear units. Thus, we are expected to implement a completely pruned optical BP neural network in the next work.
Future work is likely to extend to optical quantum neural networks, as many features of quantum optics can be directly mapping to neural networks [42], and technological advances driven by the trends of the photon quantum computing and optoelectronic industry provide possible venues for the large-scale and high bandwidth localization of quantum optical neural networks. Programmable silicon photonic devices can simulate the quantum walking dynamics of relevant particles, and all important parameters can be fully controlled, including the hamiltonian structure, evolution time, particle resolution and exchange symmetry [43]. Removing redundant photon devices in the universal unitary process by weight-elimination can facilitate the construction of large-scale and low-cost optical quantum neural networks.

Conclusions
In this paper, we demonstrate a three-ply optical neural network based on a pruned BP model. Through pruning the model in training based on weight-elimination, an optical neural network is used to perform matrix multiplication in inference. The prediction accuracy is the best when six nodes are pruned in a hidden layer by comparing different pruned MRR banks. Furthermore, results show that the prediction accuracy of a pruned model can reach 93.49% in the MNIST dataset. In terms of energy efficiency, when pruning multiple MRR weight banks, the photonic integrated circuits become more streamlined and energy-efficient. Although the training process of parameters and nonlinear activation operation are currently incomplete on the optical structure, all of these problems are expected to be solved in the future with the development of nonlinear units and further research on optical materials, such as an MRR crossbar array [39], an electro-optic architecture for synthesizing optical-to-optical nonlinearities [40] and passive optical elements [41].
In summary, pruning different nodes has important guiding significance for optical matrix components of all-optical neural networks. Significantly, this work is scalable to much more complex networks and suitable for different optical devices, and it offers a feasible path toward truly large-scale optical neural networks.
Author Contributions: Q.Z., Z.X. and D.H. designed the study, Q.Z. and Z.X. performed the research, analysed data, and were involved in writing the manuscript. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement:
Publicly available datasets were analyzed in this study. The MNIST database is "Modified National Institute of Standards and Technology database". These data can be accessed and found here at any time: https://archive.ics.uci.edu/ml/machine-learning-databases/ optdigits/.

Conflicts of Interest:
The authors declare no conflict of interest.