The mr-MDA: An Invariant to Shifting, Scaling, and Rotating Variance for 3D Object Recognition Using Diffractive Deep Neural Network

The diffractive deep neural network (D2NN) can efficiently accomplish 2D object recognition based on rapid optical manipulation. Moreover, the multiple-view D2NN array (MDA) possesses the obvious advantage of being able to effectively achieve 3D object classification. At present, 3D target recognition should be performed in a high-speed and dynamic way. It should be invariant to the typical shifting, scaling, and rotating variance of targets in relatively complicated circumstances, which remains a shortcoming of optical neural network architectures. In order to efficiently recognize 3D targets based on the developed D2NN, a more robust MDA (mr-MDA) is proposed in this paper. Through utilizing a new training strategy to tackle several random disturbances introduced into the optical neural network system, a trained mr-MDA model constructed by us was numerically verified, demonstrating that the training strategy is able to dynamically recognize 3D objects in a relatively stable way.


Introduction
Deep learning has become a prevalent machine learning method [1][2][3] in recent years. As demonstrated, a multi-layer artificial neural network imitates the human learning process and then uncovers the hidden target patterns from a large amount of data, which can be comparable or even superior to the human recognition ability. With the rapid progress in deep learning, various electronic neural networks (ENNs) currently play a crucial role in solving target classification and recognition problems. The application of ENNs has extended into many fields, including computer vision [4][5][6][7][8][9] and neural language processing [10][11][12][13]. To address increasingly complicated problems, the fundamental architecture of ENNs is becoming accumulatively complex. However, there are still some flaws in conventional ENNs, such as their dramatic energy consumption and limited computational speed, which lead to the poor coupling of micro-sensors. At present, the scale of integrated chips is mesoscopic; thus, their development is seriously constrained by Moore's law [14] and quantum tunneling. Therefore, it is crucial to find a way to address the problems above.
More recently, the diffractive deep neural network (D 2 NN), as a novel physical neural network with an all-optical deep learning framework, has been utilized to realize target classification and is practically constructed using several common methods, such as 3Dprinting [15], femtosecond photon direct writing [16], and the wet etching technique. Due to the isomorphism between the forward propagation of dense neural networks and the Huygens-Fresnel principle, the D 2 NN can be implemented by sequential network layers connected by light diffraction to achieve the complete or local connection between neurons of each diffractive layer. This has the benefits of optical computing, such as theoretically operating at light speed with an extremely low consumption, and a parallel signal processing ability for optical computing [17][18][19][20]. The D 2 NN is evolving fast and is applied in many fields, such as Fourier-space D 2 NN [21], class-specific differential detection [22], multi-directional beam steering [23], and other applications [24][25][26][27][28][29][30]. Obviously, the D 2 NN has the potential to adaptively control the phase distribution of diffraction layers according to the complex amplitude distribution of diffractive beams to uncover the potential features of data and then learn the mapping relationship between the input light field and the actual label of the input object. Specifically, the D 2 NN has demonstrated excellent performance on many datasets, such as the Mixed National Institute of Standards and Technology database (MINIST) [15], the Fashion-MNIST dataset [21], and the CIFAR-10 dataset [22]. These studies confirm the possibility of using the optical neural network to achieve pattern recognition and conduct 2D object classification based on the D 2 NNs.
In the latest research, the optical neural network employing different coding methods can be combined with ensemble learning, and it enhances statistical inference and generalization capabilities [31]. Furthermore, the D 2 NN can take advantage of ensemble learning to learn light field patterns or images from multiple viewing simultaneously to recognize a 3D object. As shown, a kind of multiple-viewing D 2 NNs array (MDA) proposed by our research group can be efficiently used to achieve 3D object recognition under turbulence disturbance based on the Model-net dataset [32]. Therefore, the 3D targets can still be successfully recognized by the MDA even when the light field patterns or images based on multiple viewing are lost.
However, the performance of the MDA is poor when the object is translated, scaled, or shifted. The D 2 NN was demonstrated to be invariant to the shifting, scaling, or rotating variance of 2D objects [33]. However, unfortunately, it does not pay attention to factors such as shifting, scaling, or rotating in actual 3D objects in the real world. Therefore, it may be vulnerable to perturbation in actual 3D object recognition due to its sensitivity to the spatial information of targets. To improve the D 2 NN's applicability to the real 3D world in terms of conducting complex object recognition, these object transformations should be considered as they frequently occur in various classification and recognition problems.

Method
To address the aforementioned problem, two measures were adopted in this paper: (1) A more robust MDA (mr-MDA) is proposed based on the plain MDA to put the base learner (BL) array on a surface instead of in a single line. The mr-MDA is composed of BLs from multiple viewing, and the bagging algorithm is used to make the final prediction, which comprehensively considers decisions obtained by all BLs. Therefore, the prediction error rate of the mr-MDA is theoretically decreased. In other words, the mr-MDA can receive more detailed information about the detected object so that it can still uncover the hidden abstracted clues to achieve classification under random disturbance; (2) a new training strategy is utilized to formulate the target transformations through random variables and then they are introduced as a disturbance to train the phases of diffractive layers, which thus enhances its capability to adapt to the uncertain transformation of input objects. This method was frequently employed in early ENNs to improve the robustness of the model, e.g., data enhancement technology [34], and to improve model generalization [35,36].

Diffraction Theory
Analysis of information forward-propagation is critical in conventional ENNs. The massive neurons of each layer are connected by forward-propagation based on electron flow. In the D 2 NN, the forward-propagation is based on Huygens' principle, which is shown in Figure 1. There are many ways to achieve information forward-propagation corresponding to a fully connective neural network, a convolutional neural network, and a recurrent neural network, etc. In the D 2 NN, the neurons of each layers are fully connected by light diffraction, as shown in Figure 2.
Analysis of information forward-propagation is critical in conventional ENNs. The massive neurons of each layer are connected by forward-propagation based on electron flow. In the D 2 NN, the forward-propagation is based on Huygens' principle, which is shown in Figure 1. There are many ways to achieve information forward-propagation corresponding to a fully connective neural network, a convolutional neural network, and a recurrent neural network, etc. In the D 2 NN, the neurons of each layers are fully connected by light diffraction, as shown in Figure 2.  According to Rayleigh-Sommerfeld diffraction theory, each neuron in the diffractive layer is considered as the source of a secondary wave, and it follows the following optical model [15]: where L represents the L-th layer of the network, i denotes the neuron located at (xi, yi, zi) in the diffractive layer, r 2 = (x − xi) 2 + (y − yi) 2 + (z − zi) 2 and j 2 = −1, and λ is the light wavelength. The amplitude and relative phase of this secondary wave are determined by the input wave of neurons and its transmission coefficient t. Specifically, the transmission coefficient can be expressed as where the α and ϕ represent the amplitude modulation and phase modulation of the input light wave by one neuron, respectively. In the design process of the phase-modulated diffractive layer, we can use isotropic materials to make it. By setting the height of each point Analysis of information forward-propagation is critical in conventional ENNs. The massive neurons of each layer are connected by forward-propagation based on electron flow. In the D 2 NN, the forward-propagation is based on Huygens' principle, which is shown in Figure 1. There are many ways to achieve information forward-propagation corresponding to a fully connective neural network, a convolutional neural network, and a recurrent neural network, etc. In the D 2 NN, the neurons of each layers are fully connected by light diffraction, as shown in Figure 2.  According to Rayleigh-Sommerfeld diffraction theory, each neuron in the diffractive layer is considered as the source of a secondary wave, and it follows the following optical model [15]: where L represents the L-th layer of the network, i denotes the neuron located at (xi, yi, zi) in the diffractive layer, r 2 = (x − xi) 2 + (y − yi) 2 + (z − zi) 2 and j 2 = −1, and λ is the light wavelength. The amplitude and relative phase of this secondary wave are determined by the input wave of neurons and its transmission coefficient t. Specifically, the transmission coefficient can be expressed as where the α and ϕ represent the amplitude modulation and phase modulation of the input light wave by one neuron, respectively. In the design process of the phase-modulated diffractive layer, we can use isotropic materials to make it. By setting the height of each point According to Rayleigh-Sommerfeld diffraction theory, each neuron in the diffractive layer is considered as the source of a secondary wave, and it follows the following optical model [15]: where L represents the L-th layer of the network, i denotes the neuron located at (x i , y i , z i ) in the diffractive layer, r 2 = (x − x i ) 2 + (y − y i ) 2 + (z − z i ) 2 and j 2 = −1, and λ is the light wavelength. The amplitude and relative phase of this secondary wave are determined by the input wave of neurons and its transmission coefficient t. Specifically, the transmission coefficient can be expressed as where the α and φ represent the amplitude modulation and phase modulation of the input light wave by one neuron, respectively. In the design process of the phase-modulated diffractive layer, we can use isotropic materials to make it. By setting the height of each point on the surface, the phase modulation of every neuron can be controlled. Therefore, one can write the output function of the i-th neuron in L-th layer as follows: Considering the information propagation of monochromatic light from a 2D plane, the Raleigh-Sommerfeld diffraction solution can be expressed as a convolution integral [37]: h(x, y) = e jkz jλz exp jk 2z where U is the light field. The form of the Raleigh-Sommerfeld impulse response can be written as Equation (5) and k = 2π/λ. To facilitate the calculation, the Fresnel approximation is used; in this situation, the impulse response can be written as Equation (6). According to Equation (4) and the Fourier convolution theorem, it is equivalently written as where represents the fast Fourier transform, U L is the plane wave in the L-th diffractive layer, −1 represents the reverse fast Fourier transform, and the H(f x ,f y ) is defined as the Raleigh-Sommerfeld transfer function given by Equation (8). The f X and f Y are independent frequency variables associated with x and y. Two input light fields are converted by the above transformation to obtain the corresponding diffractive light field in Figure 3.
n i L (x, y, z) = w i L (x, y, z) ⋅ t i L (x, y, z) ⋅ ∑ n k L−1 (x, y, z) k Considering the information propagation of monochromatic light from a 2D plane, the Raleigh-Sommerfeld diffraction solution can be expressed as a convolution integral [37]: h(x, y) = z jλ ⋅ exp(jkr) h(x, y) = e jkz jλz exp ( jk 2z (x 2 + y 2 )), where U is the light field. The form of the Raleigh-Sommerfeld impulse response can be written as Equation (5) and k = 2π/λ. To facilitate the calculation, the Fresnel approximation is used; in this situation, the impulse response can be written as Equation (6). According to Equation (4) and the Fourier convolution theorem, it is equivalently written as where  represents the fast Fourier transform, UL is the plane wave in the L-th diffractive layer, 1 −  represents the reverse fast Fourier transform, and the H(fx,fy) is defined as the Raleigh-Sommerfeld transfer function given by Equation (8). The fX and fY are independent frequency variables associated with x and y. Two input light fields are converted by the above transformation to obtain the corresponding diffractive light field in Figure 3.

Basic Architecture
The basic structure of the mr-MDA was remodified based on the original MDA and it can better adapt to the random shifting, scaling, or rotating transformations. Specifically, Figure 4 shows the configuration of the BLs in this paper, which was designed as a grid with two rows and three columns on the surface of a sphere. The red point represents a BL. The angle of adjacent BLs in the same row is demonstrated by θ v , and the angle difference of adjacent BLs in the same column is marked as θ h , which were designed to be at 30 • and 15 • , respectively. Compared with the original MDA, the additional BLs were placed in a distinct row, instead of all learners being in the same row, enabling them to receive more spatial information about the target in different viewings and then increase the independence between BLs. Taking advantage of this basic architecture, the hypothesis space of the mr-MDA is significantly extended, thus making it possible to adapt to targets with random disturbance and increasing the probability of finding an optimal answer. grid with two rows and three columns on the surface of a sphere. The red point represents a BL. The angle of adjacent BLs in the same row is demonstrated by θv, and the angle difference of adjacent BLs in the same column is marked as θh, which were designed to be at 30° and 15°, respectively. Compared with the original MDA, the additional BLs were placed in a distinct row, instead of all learners being in the same row, enabling them to receive more spatial information about the target in different viewings and then increase the independence between BLs. Taking advantage of this basic architecture, the hypothesis space of the mr-MDA is significantly extended, thus making it possible to adapt to targets with random disturbance and increasing the probability of finding an optimal answer. The detailed inference process for one sample of the mr-MDA is demonstrated by the relationships in For a more convenient explanation, the subscripts of i and j demonstrate the BL located in the ith row and the jth column in the mr-MDA. In Equation (9), wi,j R denotes the weight of the BLs, NS denotes the number of samples in the validation dataset, and I[•] represents the indicator function, which is defined as 1 or 0, respectively, when the condition in the brackets is true or false. In addition, xk is the kth sample in the validation, and its true label is marked as yk. Fi,j(•), demonstrating a mapping function of the corresponding BLs, which propagates the light field information of the input image through several diffractive layers to consequently shape a specific spot image, leading to the prediction of xk. The wi,j is obtained based on the performance of the BLs during the validation stage and then represents the reliability of one BL; its inference can be described as shown in Equation (9). Consequently, the prediction of the mr-MDA is mainly considered by the most dependable BL, and thus is affected by the rest of the BLs. The M indicates the number of categories in a dataset, which is set as 10. The αi, j R 1M denotes a vector of the prediction result of the BL for one sample, and its ith column value represents the correct rate at which the target is classified as the ith category. The s = (s (0) , s (1) , … s (M−1) )R 1M is an integrated The detailed inference process for one sample of the mr-MDA is demonstrated by the relationships in For a more convenient explanation, the subscripts of i and j demonstrate the BL located in the ith row and the jth column in the mr-MDA. In Equation (9), w i,j ∈R denotes the weight of the BLs, NS denotes the number of samples in the validation dataset, and I[·] represents the indicator function, which is defined as 1 or 0, respectively, when the condition in the brackets is true or false. In addition, x k is the kth sample in the validation, and its true label is marked as y k . F i,j (·), demonstrating a mapping function of the corresponding BLs, which propagates the light field information of the input image through several diffractive layers to consequently shape a specific spot image, leading to the prediction of x k . The w i,j is obtained based on the performance of the BLs during the validation stage and then represents the reliability of one BL; its inference can be described as shown in Equation (9). Consequently, the prediction of the mr-MDA is mainly considered by the most dependable BL, and thus is affected by the rest of the BLs. The M indicates the number of categories in a dataset, which is set as 10. The α i, j ∈R 1×M denotes a vector of the prediction result of the BL for one sample, and its ith column value represents the correct rate at which the target is classified as the ith category. The s = (s (0) , s (1) , . . . s (M−1) )∈R 1×M is an integrated prediction output vector (IPOV) of the mr-MDA, and the significance of each column value is the same as each column of α i, j . In Equation (11), the pr denotes the prediction result of the mr-MDA, which is the column number of the highest value in the IPOV.
In summary, Equation (10) guides the concrete realization of the weighted voting algorithm, and the mr-MDA draws a conclusion based on the collective wisdom, which helps the mr-MDA to accurately achieve classification despite the random disturbance already introduced into a 3D target. Additionally, Equation (10) conveys a simple linear operation, and its time complexity is only determined by the scale of the BL array, such as approximately O(m × n), which can be negligible during an entire calculation. Hence, it will exhibit an inappreciable impact on the costing time of the entire optical network system, which enables the entire system to perform an ultra-high-speed operation despite the participation of electronic devices.
As shown in Figure 5, a multiple BL of the mr-MDA will accept light fields from the different viewing angles of a 3D target. These light fields present more information, such as the depth of the target, than that of a single light field, which is the principal reason as to why the mr-MDA achieves higher accuracy even when the 3D target randomly occurs based on a shifting, rotating or scaling operation. The predicted light spot distributions are obtained by all BLs and then are transmitted to the computer or electronic circuit. Furthermore, all predicted light fields are integrated as an ensemble light field to obtain the IPOV according to the weighted voting method: an implementation of the bagging algorithm. To facilitate the description, each light spot of the ensemble light field is numbered as shown in this picture, which represents a specific category of the target. Finally, every column value of the IPOV represents the confidence level to which the target belongs in the corresponding category. The category with the highest value is the final prediction result of the mr-MDA, which is demonstrated by Equation (11). The multiple BLs are placed in different positions on the sphere with a customized distribution in order to receive the required light field images from different viewing angles. Determinations will be drawn by all BLs, which are integrated into an ensemble light field, and the final prediction is inferred by the mr-MDA by choosing the brightest light spot.
algorithm, and the mr-MDA draws a conclusion based on the collective wisdom, which helps the mr-MDA to accurately achieve classification despite the random disturbance already introduced into a 3D target. Additionally, Equation (10) conveys a simple linear operation, and its time complexity is only determined by the scale of the BL array, such as approximately O(mn), which can be negligible during an entire calculation. Hence, it will exhibit an inappreciable impact on the costing time of the entire optical network system, which enables the entire system to perform an ultra-high-speed operation despite the participation of electronic devices.
As shown in Figure 5, a multiple BL of the mr-MDA will accept light fields from the different viewing angles of a 3D target. These light fields present more information, such as the depth of the target, than that of a single light field, which is the principal reason as to why the mr-MDA achieves higher accuracy even when the 3D target randomly occurs based on a shifting, rotating or scaling operation. The predicted light spot distributions are obtained by all BLs and then are transmitted to the computer or electronic circuit. Furthermore, all predicted light fields are integrated as an ensemble light field to obtain the IPOV according to the weighted voting method: an implementation of the bagging algorithm. To facilitate the description, each light spot of the ensemble light field is numbered as shown in this picture, which represents a specific category of the target. Finally, every column value of the IPOV represents the confidence level to which the target belongs in the corresponding category. The category with the highest value is the final prediction result of the mr-MDA, which is demonstrated by Equation (11). The multiple BLs are placed in different positions on the sphere with a customized distribution in order to receive the required light field images from different viewing angles. Determinations will be drawn by all BLs, which are integrated into an ensemble light field, and the final prediction is inferred by the mr-MDA by choosing the brightest light spot.  As shown in Figure 6, an essential BL of the mr-MDA is designed as a series of diffractive layers and some existing diffractive residual blocks. Figure 6a demonstrates the key structure of a BL. In our work, the distance between each diffractive layer and between the last diffractive layer and the output layer was set to 22.72λ and 68.18λ, respectively, where λ denotes the light wavelength. Every network layer of the BL was set in a 72.72λ × 72.72λ format, with 200 × 200 layered neurons in a light field scenario and a 2.2 µm wavelength so it can be integrated into micro-detectors. There are three diffractive layers in the front of residual blocks, two diffractive layers in the back, and a photodetector as an output layer. For enhancing the capacity of the mr-MDA, there are two residual blocks and each of them is made of three diffractive layers. The BLs of the mr-MDA were trained by calculating the cross-entropy loss of the forward propagation, which was simulated. Each layer can be fabricated as a physical entity. The output of each BL is transmitted to the computer or electronic circuit to participate in weighted voting to form a final output as a prediction.
into micro-detectors. There are three diffractive layers in the front of residual blocks, two diffractive layers in the back, and a photodetector as an output layer. For enhancing the capacity of the mr-MDA, there are two residual blocks and each of them is made of three diffractive layers. The BLs of the mr-MDA were trained by calculating the cross-entropy loss of the forward propagation, which was simulated. Each layer can be fabricated as a physical entity. The output of each BL is transmitted to the computer or electronic circuit to participate in weighted voting to form a final output as a prediction.
where I = (I (0) , I (1) , … I (M−1) )R 1M denotes a light intensity vector of the spot in the output layer of the BL, and I (i) is the light intensity of the ith light spot. In Equation (13), L is the cross entropy, which statistically indicates the difference between the prediction of the mr-MDA and the true target label. The α, which is one of αi,j and a normalized vector of the light intensity vector, can be calculated using a soft-max function, as indicated in Equation (12). The β (i) denotes the i th column of the true one-hot label, and its value can be selected as 1 or 0. According to previous studies, the original deep neural network has several defects, such as a large number of model parameters, a long training time, and the risk of overfitting. As demonstrated, the residual deep neural network [38] can be utilized to help the network to overcome the above shortcomings in the case of the same layer number. Therefore, it is possible to build a deeper ENN that can better uncover potential patterns or images from more complex samples. It was verified that the scheme is realizable in an optical neural network and has excellent performance compared with the architecture of the plain D 2 NN in a recent study [39]. Consequently, to enhance the expression capability For a given 3D object, every trained BL can be used to obtain a customized optical spot pattern based on a specific single viewing light field. All BLs adopt cross entropy as a loss function for calculating the output loss and further employ the Adam algorithm and error backpropagation to update the phase of the diffractive layer. The loss function for one training sample is demonstrated by the expressions in where I = (I (0) , I (1) , . . . I (M−1) )∈R 1×M denotes a light intensity vector of the spot in the output layer of the BL, and I (i) is the light intensity of the ith light spot. In Equation (13), L is the cross entropy, which statistically indicates the difference between the prediction of the mr-MDA and the true target label. The α, which is one of α i,j and a normalized vector of the light intensity vector, can be calculated using a soft-max function, as indicated in Equation (12). The β (i) denotes the i th column of the true one-hot label, and its value can be selected as 1 or 0. According to previous studies, the original deep neural network has several defects, such as a large number of model parameters, a long training time, and the risk of overfitting. As demonstrated, the residual deep neural network [38] can be utilized to help the network to overcome the above shortcomings in the case of the same layer number. Therefore, it is possible to build a deeper ENN that can better uncover potential patterns or images from more complex samples. It was verified that the scheme is realizable in an optical neural network and has excellent performance compared with the architecture of the plain D 2 NN in a recent study [39]. Consequently, to enhance the expression capability of the mr-MDA to cope with shifting, scaling, or rotating targets and to further avoid the problem of vanishing gradients and overfitting in model training, each unit of the BLs should contain diffractive residual blocks. Figure 6b shows the fundamental construction of the diffractive residual block for base learning. Light splitting and focusing can be achieved in the residual block by employing a beam splitter and reflecting mirror.

Experimental Preparation
To better explain the transformations mentioned, some imaging results received by the BLs located in distinct perspectives after introducing random disturbance into one of samples from the dataset are demonstrated in Figure 7. The proposed training strategy introduces random shifting, scaling, and rotating variance into the original targets, and the degree of random object transformation is respectively represented by independently and uniformly distributed random variables. The parameters of ∆, K, and Θ are all three-dimensional vectors used to describe spatial information after undergoing some uncertain beam transformations. ∆ = (∆ x , ∆ y , ∆ z ) indicates that the input object presents some shifts of ∆ x , ∆ y , and ∆ z in the x, y, and z directions, respectively. Each of their column values is defined as a uniformly distributed variable, for instance, ∆ x~U (−δ, δ), ∆ y~U (−δ, δ), ∆ z~U (−δ, δ). The hyperparameter δ represents the displacement level of the target, and δ ∈ [0, 0.7] in this letter. Therefore, the increase in the value of δ will result in an increased possibility and a variance range of the target displacement, which will also introduce a greater disturbance into the model. To appropriately quantify the influence of ∆ on the image formed by a single viewing, the significance of ∆ x , ∆ y , and ∆ z is carefully described as shown in Figure 8. Both the scaling and rotating parameters are individually defined as K= (k x , k y , k z ) and Θ= (Θ x , Θ y , Θ z ), which means that the original targets are scaled and rotated along the x and y and z directions, respectively. In order to maintain the proper object proportion, the K is further designed into k x = k y = k z~U (1 − ε, 1 + ε) and ε ∈ [0, 0.7] in this letter. To this end, each column of Θ individually conforms to a uniform distribution of U (−θ, θ) and θ ∈ [0 • , 60 • ] in this manuscript. For the sake of convenience, a parameter of view i,j (v i,j ) was used to represent the BL located in the ith row and the jth column in our BLs' array, and they will receive the input light field image of the target in different views.

Experimental Preparation
To better explain the transformations mentioned, some imaging results received by the BLs located in distinct perspectives after introducing random disturbance into one of samples from the dataset are demonstrated in Figure 7. The proposed training strategy introduces random shifting, scaling, and rotating variance into the original targets, and the degree of random object transformation is respectively represented by independently and uniformly distributed random variables. The parameters of Δ, K, and Θ are all threedimensional vectors used to describe spatial information after undergoing some uncertain beam transformations. Δ = (Δx, Δy, Δz) indicates that the input object presents some shifts of Δx, Δy, and Δz in the x, y, and z directions, respectively. Each of their column values is defined as a uniformly distributed variable, for instance, Δx ~ U (−δ, δ), Δy ~ U (−δ, δ), Δz ~ U (−δ, δ). The hyperparameter δ represents the displacement level of the target, and δ  [0, 0.7] in this letter. Therefore, the increase in the value of δ will result in an increased possibility and a variance range of the target displacement, which will also introduce a greater disturbance into the model. To appropriately quantify the influence of Δ on the image formed by a single viewing, the significance of Δx, Δy, and Δz is carefully described as shown in Figure 8. Both the scaling and rotating parameters are individually defined as K= (kx, ky, kz) and Θ= (Θx, Θy, Θz), which means that the original targets are scaled and rotated along the x and y and z directions, respectively. In order to maintain the proper object proportion, the K is further designed into kx = ky = kz~U (1 − ε, 1+ε) and ε  [0, 0.7] in this letter. To this end, each column of Θ individually conforms to a uniform distribution of U (−θ, θ ) and θ  [0°, 60°] in this manuscript. For the sake of convenience, a parameter of viewi,j(vi,j) was used to represent the BL located in the i th row and the j th column in our BLs' array, and they will receive the input light field image of the target in different views.  For the given hyperparameters of δ, ε, and θ, the range of a random disturbance introduced into the original target is controllable. Therefore, we can adjust their values to illustrate the impact of the object's shifting, scaling, and rotating variance on the prediction results of the mr-MDA mode. Furthermore, by assigning different values to the variables on both the training dataset and the blind test data, we demonstrate the prospective performance of the mr-MDA using a new training strategy.

Results and Discussion
The performance comparison of the mr-MDA, the BLs, and the MDA is demonstrated in Figure 9a-c. It was verified that the architecture of the mr-MDA is more capable of adapting the 3D target, which already has random disturbance introduced, than the MDA. This is significant progress compared with a single viewing BL. Specifically, the classification accuracy of the mr-MDA at all sampling points is superior as compared to the other two models mentioned. According to the experiments, it can be seen that the test accuracy of the mr-MDA is higher than that of the BL by ~30% and the MDA by ~15% under the situation of target shifting, by ~30% and ~10% under target scaling, and by ~15% and ~10% under target rotating. The results mentioned above were obtained based on strong interference (more details are shown in Figure 9). In addition, as the values of the variables increase, the recognition capabilities of the single viewing BL, MDA, and mr-MDA decline significantly. However, the performance of the mr-MDA declines more slowly owing to the mr-MDA receiving more target spatial information. It should be noted that the mr-MDA demonstrates a larger hypothesis space in statistics, which is sufficient to uncover more complex patterns. Moreover, concrete data as regards specific interference are shown in Table 1. The decision making combined with multiple perspectives is more accurate than only relying on one single perspective, and this phenomenon becomes more obvious with the increase in the number of perspectives when interference is introduced. However, this difference is not apparent with no interference. For the given hyperparameters of δ, ε, and θ, the range of a random disturbance introduced into the original target is controllable. Therefore, we can adjust their values to illustrate the impact of the object's shifting, scaling, and rotating variance on the prediction results of the mr-MDA mode. Furthermore, by assigning different values to the variables on both the training dataset and the blind test data, we demonstrate the prospective performance of the mr-MDA using a new training strategy.

Results and Discussion
The performance comparison of the mr-MDA, the BLs, and the MDA is demonstrated in Figure 9a-c. It was verified that the architecture of the mr-MDA is more capable of adapting the 3D target, which already has random disturbance introduced, than the MDA. This is significant progress compared with a single viewing BL. Specifically, the classification accuracy of the mr-MDA at all sampling points is superior as compared to the other two models mentioned. According to the experiments, it can be seen that the test accuracy of the mr-MDA is higher than that of the BL by~30% and the MDA by~15% under the situation of target shifting, by~30% and~10% under target scaling, and by~15% and 10% under target rotating. The results mentioned above were obtained based on strong interference (more details are shown in Figure 9). In addition, as the values of the variables increase, the recognition capabilities of the single viewing BL, MDA, and mr-MDA decline significantly. However, the performance of the mr-MDA declines more slowly owing to the mr-MDA receiving more target spatial information. It should be noted that the mr-MDA demonstrates a larger hypothesis space in statistics, which is sufficient to uncover more complex patterns. Moreover, concrete data as regards specific interference are shown in Table 1. The decision making combined with multiple perspectives is more accurate than only relying on one single perspective, and this phenomenon becomes more obvious with the increase in the number of perspectives when interference is introduced. However, this difference is not apparent with no interference.  The phase diagram in Figure 9d-g demonstrates that after introducing disturbance into the training dataset, the phase of the corresponding diffractive layer also changed significantly compared with that of the standard phase. Specifically, the spiral phase shown in Figure 9g is more evident and adjacent to the center of the figure. In contrast, the other spiral phases in Figure 9d-f are distinct, owing to the diffraction layer of the mr-MDA taking random disturbance into account in the error propagation of the training phase and adaptively resisting it. Furthermore, the mr-MDA is driven based on ensemble learning in the D 2 NN, so its performance is closely related to the independence between the BLs, which leads to a superior performance as compared to the plain D 2 NN and MDA as random disturbance increases.
The detailed inference of the mr-MDA is the same as that of the MDA; we present two of the predictions in Figure 10. Every trained BL individually outputs a specific spot pattern, and some are in line with the expectations, but others are wrong. According to the BL weight, the mr-MDA draws a conclusion based on collective wisdom using a weighted voting algorithm. The brightest light spot exists at the desired position (as shown by the white arrow) in the subfigures with a red frame, which indicates a correct prediction result. In the sample on the left in Figure 10, the mr-MDA achieves a correct prediction despite two BLs already making the wrong decision, because each BL is given the appropriate weight during training; thus, it makes the correct decision even when more than half of the BLs make wrong predictions, as shown on the right in Figure 10. In addition, Figure 10 also shows that the light distribution of the BL is significantly different from the light distribution of the mr-MDA, and the latter is only prominent at one certain point, which does not interfere with the final decision.  The phase diagram in Figure 9d-g demonstrates that after introducing disturbance into the training dataset, the phase of the corresponding diffractive layer also changed significantly compared with that of the standard phase. Specifically, the spiral phase shown in Figure 9g is more evident and adjacent to the center of the figure. In contrast, the other spiral phases in Figure 9d-f are distinct, owing to the diffraction layer of the mr-MDA taking random disturbance into account in the error propagation of the training phase and adaptively resisting it. Furthermore, the mr-MDA is driven based on ensemble learning in the D 2 NN, so its performance is closely related to the independence between the BLs, which leads to a superior performance as compared to the plain D 2 NN and MDA as random disturbance increases.
The detailed inference of the mr-MDA is the same as that of the MDA; we present two of the predictions in Figure 10. Every trained BL individually outputs a specific spot pattern, and some are in line with the expectations, but others are wrong. According to the BL weight, the mr-MDA draws a conclusion based on collective wisdom using a weighted voting algorithm. The brightest light spot exists at the desired position (as shown by the white arrow) in the subfigures with a red frame, which indicates a correct prediction result. In the sample on the left in Figure 10, the mr-MDA achieves a correct prediction despite two BLs already making the wrong decision, because each BL is given the appropriate weight during training; thus, it makes the correct decision even when more than half of the BLs make wrong predictions, as shown on the right in Figure 10. In addition, Figure 10 also shows that the light distribution of the BL is significantly different from the light distribution of the mr-MDA, and the latter is only prominent at one certain point, which does not interfere with the final decision. In the end, the final spot pattern is obtained by weighted voting, which has the highest light intensity at the correct spot position. The weighted voting algorithm allows the mr-MDA to consider all results of BLs and then adaptively favor the more reliable ones, which explains why the BL making a wrong decision has a smaller weight but the final prediction is correct, as shown in Figure  10. This pattern of decision making not only improves the fault tolerance rate of the whole micro-system, but also enables the whole optical process to work in parallel.
Apparently, the modified architecture of the mr-MDA is more robust than that of the MDA. To further improve the robustness of the mr-MDA, we adopted a training strategy that defines both parameters of δtrain and δtest as the shifting variable in the training dataset and test dataset, both εtrain and εtest as the scaling variable in the training dataset and test dataset, and both θtrain and θtest as the rotating variable in the training dataset and test dataset. The mr-MDA has prior knowledge and is not sensitive to spatial information concerning shifting, scaling, or rotating while achieving 3D target classification. Figure 11 shows the excellent results obtained by employing the new training strategy. In the end, the final spot pattern is obtained by weighted voting, which has the highest light intensity at the correct spot position. The weighted voting algorithm allows the mr-MDA to consider all results of BLs and then adaptively favor the more reliable ones, which explains why the BL making a wrong decision has a smaller weight but the final prediction is correct, as shown in Figure 10. This pattern of decision making not only improves the fault tolerance rate of the whole micro-system, but also enables the whole optical process to work in parallel.
Apparently, the modified architecture of the mr-MDA is more robust than that of the MDA. To further improve the robustness of the mr-MDA, we adopted a training strategy that defines both parameters of δ train and δ test as the shifting variable in the training dataset and test dataset, both ε train and ε test as the scaling variable in the training dataset and test dataset, and both θ train and θ test as the rotating variable in the training dataset and test dataset. The mr-MDA has prior knowledge and is not sensitive to spatial information concerning shifting, scaling, or rotating while achieving 3D target classification. Figure 11 shows the excellent results obtained by employing the new training strategy.
The mr-MDA trained by the shifting, scaling, and rotating datasets presents several obvious advantages compared with that trained only by the standard dataset. As the disturbance on the test dataset increases, the performance of the mr-MDA using the normal training strategy declines sharply, while the performances after employing the new training strategy drop slowly or remain almost stable. Specifically, Figure 11a shows that a standard mr-MDA can achieve excellent blind test accuracy in a weak disturbance. However, it also obtains a small mean and high variance behavior over the entire experiment (e.g., the standard mr-MDA achieved a~85.9% test accuracy when δ test = 0, ε test = 0, and θ test = 0, but its final average test accuracy was only~43.1%,~71.2%, and~58.0%, and the variance was~5.7%,~1.3%, and~3.9%), which is not enough to conduct satisfactory 3D target classification after random disturbance has already been introduced. The mr-MDA trained with the standard dataset lacks prior knowledge of the 3D spatial information of targets and only learns the standard spatial information, such as the position, size, and direction of targets. In statistics, it is impossible to find an optimal solution to adapt to the uncertain object. Therefore, when interference is introduced into the training data, the mean and variance of the mr-MDA over the entire experiment are greatly improved (e.g., when δ train = 0.4, the average accuracy reached~57% and its variance was only~0.7%. When ε train = 0.5, the average accuracy reached~78.5% and its variance was~0.2%. When θ train = 30, the average accuracy reached~65.9% and its variance was 1.1%). Remarkably, excessive interference in the training dataset can reduce the prediction accuracy, but the variance still decreases (e.g., as shown in Figure 11, when δ train = 0.7, ε train = 0.7, and θ train = 60, the variance was approximately zero but their averages were~54.5% and~77.2% and~56.5%). The mr-MDA trained by the shifting, scaling, and rotating datasets presents several obvious advantages compared with that trained only by the standard dataset. As the disturbance on the test dataset increases, the performance of the mr-MDA using the normal training strategy declines sharply, while the performances after employing the new training strategy drop slowly or remain almost stable. Specifically, Figure 11a shows that a standard mr-MDA can achieve excellent blind test accuracy in a weak disturbance. However, it also obtains a small mean and high variance behavior over the entire experiment (e.g., the standard mr-MDA achieved a ~85.9% test accuracy when δtest = 0, εtest = 0, and θtest = 0, but its final average test accuracy was only ~43.1%, ~71.2%, and ~58.0%, and the variance was ~5.7%, ~1.3%, and ~3.9%), which is not enough to conduct satisfactory 3D target classification after random disturbance has already been introduced. The mr-MDA trained with the standard dataset lacks prior knowledge of the 3D spatial information of targets and only learns the standard spatial information, such as the position, size, and direction of targets. In statistics, it is impossible to find an optimal solution to adapt to the uncertain object. Therefore, when interference is introduced into the training data, the mean and variance of the mr-MDA over the entire experiment are greatly improved (e.g., when δtrain = 0.4, the average accuracy reached ~57% and its variance was only ~0.7%. When εtrain = 0.5, the average accuracy reached ~78.5% and its variance was ~ 0.2%. When θtrain = 30, the average accuracy reached ~65.9% and its variance was 1.1%). Remarkably, excessive interference in the training dataset can reduce the prediction accuracy, but the variance still decreases (e.g., as shown in Figure 11, when δtrain = 0.7, εtrain = 0.7, and θtrain = 60, the variance Figure 11. Comparison of the performances using different training strategies. The comparison between the inference accuracies of the mr-MDA resisting target shifting (left) (a), resisting target scaling (left) (b), and resisting target rotation (left) (c). The variance of each curve is in the left picture (right). The value corresponding to the red point in the right subgraphs represents the mean, and the distance from the red point to the upper and lower ends denotes the variance.
In summary, the proposed mr-MDA can effectively recognize 3D objects that are shifted, scaled, and rotated as a result of its basic architecture, which already enhances the independence between the BLs. Furthermore, it was verified that the proposed training strategy can appropriately increase the mean value of the final decision accuracy and decrease its variance, which plays an essential role in using the D 2 NN to stably recognize 3D objects. In this work, the mr-MDA was trained based on the Model-net dataset using Python 3.8.9 and the framework of Tensorflow 2.0, and we chose blender 2.79b to obtain a multi-viewing imaging light field of 3D targets. The training and testing of the mr-MDA were run on a PC with Intel Core i7-10700 CPU (2.90GHZ) and the GeForce RTX 2070 (NVIDIA). It generally took~20 min to train one BL and thus cost~375.12KB in memory consumption. Since the BLs were trained independently, each BL was trained using different distributed server nodes.

Conclusions
According to this experiment, it was concluded that the D 2 NN can indeed theoretically accomplish various complicated tasks. The original MDA model was remarkably improved and a new training strategy utilized to obviously expand the applications of the mr-MDA