Design of Cascaded Diffractive Optical Elements for Optical Beam Shaping and Image Classiﬁcation Using a Gradient Method

: We present a gradient method for designing cascaded diffractive optical elements (DOEs) consisting of several sequentially located phase DOEs. Using the unitarity property of the operator of light propagation through the cascaded DOE, we obtain explicit expressions for the derivatives of the error functional with respect to the phase functions of the cascaded DOE. We consider the application of the gradient method to the problem of focusing different incident beams to regions with different intensity distributions and to the problem of optical image classiﬁcation. The presented description of the gradient method treats the problems of designing cascaded DOEs for focusing laser radiation and for image classiﬁcation within a uniﬁed approach reducing the calculation of the derivatives of the error functionals to the same formula. We present examples of the calculation of single and cascaded DOEs for focusing different incident beams to different regions and for classifying handwritten digits, which demonstrate the high performance of the proposed method. The presented results may ﬁnd application in the design of diffractive neural networks and systems for focusing laser radiation.


Introduction
Nowadays, the design and investigation of diffractive optical elements (DOEs) are the subjects of active research [1][2][3][4][5][6][7].The main reasons for the interest in this research field are the compactness, manufacturability, and efficiency of using DOEs for solving a wide range of problems of transforming and focusing optical radiation.As a rule, the DOE design is carried out within the framework of the scalar diffraction theory.The problem of calculating a DOE belongs to the class of ill-posed inverse problems and consists in determining the shape of the "phase" diffractive microrelief, which ensures the formation of a light field with specified parameters (usually, with a required intensity distribution) in a given region of space.Since the height of the DOE microrelief is proportional to the phase function of the light field formed by the DOE, the problem of DOE design is usually considered as the problem of calculating a phase function, which ensures the generation of the required intensity distribution.For the calculation of the phase function, various iterative algorithms are traditionally used including the "classical" Gerchberg-Saxton algorithm, the error-reduction algorithm, and a wide range of their modifications [8][9][10][11][12][13][14].
In addition to single DOEs, the so-called cascaded DOEs are widely used, consisting of several sequentially located phase DOEs.Such DOEs possess a wider functionality and make it possible to solve more complex problems, e.g., the problem of focusing different incident beams (in particular, with different propagation directions or different wavelengths) into different regions [2,15,16].For the calculation of cascaded DOEs, iterative algorithms

Problem Statement
Let the complex amplitude of the "input" field w 0 (u 0 ) be defined in the input plane z = f 0 = 0, where u 0 = (u 0 , v 0 ) are the Cartesian coordinates in this plane.We will assume that the light field with the wavelength λ propagates from the plane z = 0 through a set of n phase DOEs located in the planes z = f 1 , . . ., z = f n (0 < f 1 < . . .< f n ), and finally comes to the output plane z = f n+1 > f n (Figure 1).
Let us denote by ϕ 1 (u 1 ), . . . ,ϕ n (u n ) the phase functions of the DOEs, where u j = u j , v j are the Cartesian coordinates in the planes z = f 1 , . . ., z = f n .We will assume that the propagation of light between the planes z = f i , i = 1, . . ., n + 1 is described by the Fresnel-Kirchhoff integral of the scalar diffraction theory.We will describe the propagation of light through a DOE in the thin optical element approximation as the multiplication of the complex amplitude of the incident beam by the complex transmission function of the DOE exp{iϕ m (u m )}, m = 1, . . .n.In this case, the propagation of light through a cascaded DOE is described by the following expressions:  In this case, the propagation of light through a cascaded DOE is described by the following expressions: where w m (u m ), m = 1, . . ., n are the complex amplitudes of the fields incident on DOEs located in the planes z = f m , and d m = f m − f m−1 are the distances between these planes.According to Equation (1), the calculation of the complex amplitude of the output field w n+1 (u n+1 ) is carried out recursively.For the following analysis, it will be convenient to consider Equation (1) as a representation of linear operators describing the propagation of light from the input plane z = f 0 to the planes z = f m , m = 1, . . ., n + 1.
Under the inverse problem, we will understand the problem of calculating the phase functions ϕ 1 (u 1 ), . . . ,ϕ n (u n ) from the condition of generating a light field with the required intensity distribution I(u n+1 ) in the output plane.Let us describe the error of generating the required distribution using an integral criterion where I n+1 (u n+1 ) = |w n+1 (u n+1 )| 2 is the intensity distribution generated by a cascaded DOE with phase functions ϕ 1 (u 1 ), . . . ,ϕ n (u n ), and D is a certain function representing the difference between the generated and required distributions at the current point.
In what follows, we will consider the inverse problem of designing a cascaded DOE as the problem of minimizing the functional of Equation ( 2), which we will refer to as the error functional:

Gradient Method for Calculating a Cascaded DOE
For solving the problem of Equation ( 3), we will utilize a gradient method.Let us consider the calculation of the derivative of the functional of Equation ( 3) with respect to the function ϕ m .Let us denote by the increment of the error functional caused by the increment ∆ϕ m of the function ϕ m .According to Equation (2), this increment has the form where ∆ m I n+1 (u n+1 ) and ∆ m w n+1 (u n+1 ) are the increments of the intensity and complex amplitude caused by the phase increment ∆ϕ m , the angle brackets denote the scalar product of functions, and the function F n+1 (u n+1 ) has the form For the following derivations, let us introduce the operator Pr f n+1 → f + m of the "backpropagation" of light from the output plane z = f n+1 to the plane z = f + m located immediately after the plane of the m-th DOE z = f m .In this plane, the complex amplitude of the field in the case of "direct" propagation reads as w m e iϕ m .Let us present formulas for calculating this operator by starting with the field F n+1 (u n+1 ) (see Figure 1).At m = n, the backpropagation operator corresponds to the Fresnel-Kirchhoff integral, in which the propagation distance d n+1 is taken with a negative sign: At m < n, the operator Pr ) is calculated recursively using the following expression: One can easily show that the operators of direct propagation and backpropagation of light through a set of phase DOEs are unitary and conserve the scalar product [18].Due to the scalar product conservation, the increment of the criterion of Equation ( 5) can be represented as Since Pr , where w m e iϕ m is the complex amplitude of the field immediately after the plane of the m-th DOE at direct propagation and Pr Let us transform Equation ( 9) to the form By substituting the increment ∆e iϕ m as its Taylor series expansion up to the linear term ∆e iϕ m = e iϕ m +i∆ϕ m − e iϕ m ≈ i∆ϕ m e iϕ m , let us write the main (linear) part of the increment of the functional (10) as According to Equation (11), the derivative of the functional has the form When solving the problem of minimizing the functional (3) using the gradient method, the calculation of the phase functions of the DOEs is carried out iteratively.Let us describe the calculations performed at each iteration of the method.Let ϕ k 1 (u 1 ), . . . ,ϕ k n (u n ) be the phase functions of the cascaded DOE obtained at the k-th iteration.Then, for the calculation of the next approximations of the phase functions, the following steps are performed: (1) Using Equation (1) describing the direct propagation of the field, complex amplitudes of the fields w m (u m )e iϕ m (u m ) in the planes z = f m , m = 1, . . ., n and in the output plane z = f n+1 are calculated.
(4) New approximations of the phases are found as where t k is the step of the gradient method.

Application of the Gradient Method to the Case of Several Incident Beams
The presented gradient method can be easily generalized to the problem, in which there are defined K > 1 different input distributions w 0,j (u 0 ), j = 1, . . ., K (different incident beams), and, for each input distribution, the cascaded DOE has to generate a corresponding output intensity distribution I j (u n+1 ).In this case, the following sum of functionals can be used as the error functional: where the functionals ε j (ϕ 1 , . . . ,ϕ n ) describe the difference of the intensity distributions I n+1,j (u n+1 ) generated for the input distributions w 0,j (u 0 ) from the required distributions I j (u n+1 ).Without loss of generality, we can assume that these functionals are defined by Equation (2).Since the derivatives of the sum of functionals (14) simply equal the sum of the derivatives of these functionals the calculation of the derivatives of the functional of Equation ( 14) is also reduced to Equation (12).The calculation of new approximations of the phase functions at each iteration is carried out using a formula similar to Equation (13).The investigation of the performance of the presented method in the problem of the design of cascaded DOEs generating different intensity distributions at different incident beams is presented below in Section 6.1.Let us note that the considered gradient method can also be easily generalized to the problem, in which the required intensity distributions I j (u n+1 ) are defined in different output planes located at different distances from the output DOE.

Application of the Gradient Method to the Image Classification Problem
Let us now consider the application of the developed gradient method to the design of cascaded DOEs performing optical image classification.Let the amplitude images of objects belonging to L different classes (for example, the images of handwritten digits) be generated in the input plane z = 0.The generated light field then propagates through the cascaded DOE and comes to the output plane z = f n+1 .Let L spatially separate regions G i corresponding to the images of different classes be defined in the output plane.At each input distribution (image), certain energy distribution E i , i = 1, . . ., L is generated in these regions, where E i corresponds to the integral of the generated intensity distribution over the region G i .In this case, the problem of designing a cascaded DOE for classifying images can be formulated as a problem of calculating such phase functions ϕ 1 (u 1 ), . . . ,ϕ n (u n ) of the cascaded DOE, so that for an "input signal" corresponding to an image of a certain j-th class, the maximum of the generated energies E i , i = 1, . . ., L is achieved in the corresponding region G j [3,5].
In the problems of calculating cascaded DOEs for image classification, approaches typical for the design of artificial neural networks are used [3,5,[19][20][21].In this case, for the design (training) of the cascaded DOE, a training data set is used, which contains a number of input distributions corresponding to the images of objects belonging to different classes.Due to the large size of the training set, usually, for performing a training step, a smaller set (batch) of distributions is randomly chosen from the whole set, for which the derivatives of the error functional are calculated.One can show that the expectations of the derivatives calculated over a batch are proportional to the derivatives calculated using the whole training set, which enables considering this approach as a stochastic gradient method.
The training of a cascaded DOE on a particular batch corresponds to the gradient method in the case of several incident beams.Indeed, the error functional in the case of training on a batch can be defined in the form of Equation ( 14), where K is the batch size, and the functionals ε j (ϕ 1 , . . . ,ϕ n ) represent the classification errors of different classes included in the batch.The difference of the image classification problem from the problem of generating different intensity distributions for different incident beams consists mainly in the form of the functionals ε j (ϕ 1 , . . . ,ϕ n ).In the following subsections, we will consider two error functionals used for the solution of the classification problem and will show that the calculation of the derivatives of these functionals is also reduced to the general Formula (12).

Quadratic Error Functional
Let w 0,j (u 0 ) be an input distribution corresponding to some image of the j-th class.The energy values in the regions G k of the output plane in this case have the form where χ k (u n+1 ) is the indicator function of the region G k .For recognizing (i.e., correctly classifying) the input image w 0,j (u 0 ), it is necessary for the energy E j in the corresponding region G j to be "large", with the energies in the rest of the regions being close to zero.Accordingly, the following quadratic functional can be utilized as the error functional for recognizing an input image of the j-th class: where δ k,j is the Kronecker delta and E max is the maximum possible energy value.As for the E max value, one can, for example, use the total energy of the input distribution Let us demonstrate that the calculation of the derivatives of the functional ε j (ϕ 1 , . . . ,ϕ n ) is very similar to the calculation of the derivatives of the "general" functional of Equation ( 2) considered in Section 3. Indeed, let ∆ m ε j (ϕ 1 , . . . ,ϕ n ) be the increment of the functional of Equation ( 17) caused by an increment ∆ϕ m of the function ϕ m .According to Equation (17), this increment has the form where Similarly to Equation ( 5), we obtained the increment of the functional ∆ m ε j (ϕ 1 , . . . ,ϕ n ) in the form of a scalar product.Thus, the derivatives of the functional ε j (ϕ 1 , . . . ,ϕ n ) are also defined by Equation (12), where the functions F m (u m ) are calculated through the backpropagation of the field of Equation ( 19).

Cross Entropy Functional
In the classification problems, the so-called softmax cross entropy is used as a criterion [19,20].In this case, the following error functional is used for recognizing an input distribution belonging to the j-th class w 0,j (u 0 ): where E k are the energies in the regions G k defined by Equation ( 16).Let us note that Equation ( 20) is close to zero when the energy in the required region G j is much greater than the energies in the other regions.
Let us consider the increment of the functional (20) caused by an increment ∆ϕ m of the function ϕ m .By carrying out transformations similar to those presented above, it is easy to obtain the increment of the functional as where As in the previous case, we obtained the increment of the functional ∆ m ε j (ϕ 1 , . . . ,ϕ n ) as a scalar product.Accordingly, the derivatives of the functional (20) are also defined by Equation (12), where the functions F m (u m ) are calculated through the backpropagation of the field of Equation (22).Thus, the calculation of the phase functions of the cascaded DOE in the problem of image classification consists of the following.For the current batch, the gradient of the functional ( 14) is calculated, where the calculation of the derivatives of the terms is carried out using Equations ( 12) and (19) or Equations ( 12) and ( 22) depending on the chosen criterion.After calculating the derivatives of the functional ( 14), the phase functions are corrected using a formula similar to Equation (13).Then, the next batch is considered and the process is repeated.The investigation of the performance of the proposed method in the problem of classifying handwritten digits is presented below in Section 6.2.

Numerical Examples of Cascaded DOE Design
In the previous two sections, we considered the application of the proposed gradient method to the problems of calculating cascaded DOEs for the generation of required intensity distributions (for several incident beams) and for optical image classification.In the present section, numerical examples illustrating the performance of the method in the indicated problems are presented.In Section 6.1, we discuss the design of cascaded DOEs for generating different intensity distributions at different angles of incidence of the input beam, and Section 6.2 is dedicated to the design of DOEs for classifying handwritten digits.

Design of Cascaded DOEs for Focusing Different Incident Beams to Different Regions
Let in the input plane of the cascaded DOE, four input distributions w 0,j (u 0 ), j = 1, 2, 3, 4 be defined, which correspond to Gaussian beams with the radius at the 1/e 2 level equal to 2σ = 2.3 mm and the wavelength λ = 532 nm, incident on this plane from different directions.Let the vectors defining the propagation directions of the beams w 0,1 (u 0 ) and w 0,2 (u 0 ) lie in the plane u 0 z and make angles ±θ = ±0.16• with the z axis, and the corresponding vectors of the beams w 0,3 (u 0 ) and w 0,4 (u 0 ) lie in the plane v 0 z and also make angles ±θ with the z axis.The complex amplitudes of these beams in the plane z = 0 have the form Let us consider the calculation of cascaded DOEs generating in the output plane z = 600 mm different uniform intensity distributions I j (u n+1 ), j = 1, 2, 3, 4 for the incident beams of Equation (23).The four output distributions are centered at the origin of coordinates and correspond to a circle with the diameter of 2.3 mm, contour of a square with the side of 2.3 mm, a cross consisting of two perpendicular segments with the length of 2.3 mm, and a "rotated cross" consisting of two diagonals of the square with the side of 2.3 mm (Figure 2).The thickness of the lines of the required output intensity distributions amounts to 0.2 mm.The calculation of the phase functions of the DOE was carried out using the gradient method described above.As the error functional, the sum of functionals ( 14) was used, where the functionals representing the difference between the required distributions and the ones generated at the input fields ( ) 0, 0 j w u were defined as ( ) 512 × 512 grids with the step of d = 18 µm (these parameters correspond to some of the available spatial light modulators, which can be used as DOEs).In this case, the side length of the square aperture of each DOE amounts to 9.216 mm.
Let us note that at the chosen parameters, the incident beams strongly overlap in the planes of the DOEs.For example, after the propagation to the plane z = 300 mm, the centers of the beams are displaced from the optical axis (the z axis) only by 300 tgθ ≈ 0.83 mm, which is significantly smaller than the radius of the beams.This overlap of the incident beams significantly complicates the problem of calculating the cascaded DOE.
The calculation of the phase functions of the DOE was carried out using the gradient method described above.As the error functional, the sum of functionals ( 14) was used, where the functionals ε j (ϕ 1 , . . . ,ϕ n ) representing the difference between the required distributions and the ones generated at the input fields w 0,j (u 0 ) were defined as At each iteration, the derivatives of the error functional were calculated, which, according to Equation ( 17), correspond to the sum of derivatives of the functionals ε j (ϕ 1 , . . . ,ϕ n ).The calculation of the derivatives of the functionals ε j (ϕ 1 , . . . ,ϕ n ) was carried out using Equation (12), where w m (u m ) = w m,j (u m ) is the complex amplitude of the field incident on the m-th DOE in the case of the direct propagation of the incident beam w 0,j (u 0 ), and the function F m (u m ) = F m,j (u m ) is calculated through the backpropagation of the field where w n+1,j (u n+1 ) is the complex amplitude of the field in the output plane.In the optimization, the calculation of the functions w m,j (u m ) and F m,j (u m ) featured in the expres- sions for the derivatives of the functionals was based on the numerical calculation of the Fresnel-Kirchhoff integrals using the fast Fourier transform routine.Figure 3 shows the calculated phase functions of one, two, and three DOEs.For the calculation of each example, 8000 iterations with an exponentially decreasing step were performed (such a number of iterations turned out to be sufficient for the convergence of the method).As initial values, phases equal to zero at the whole aperture were used.The calculation time on a standard PC (Intel Core i9 10920X CPU, 3.50 GHz) was from 30 min for the single DOE to approximately one hour for the cascade of three DOEs.
One can see that the calculated phase functions of the single DOE and of the first DOEs in the cascaded structures are close to zero (to the initial phase value) near the edges of the aperture.This is caused by the fact that the amplitude of the fields generated in the plane of the first DOE in the case of the input beams of Equation ( 23) is close to zero in the peripheral regions of the aperture.Since the derivatives of the error functional are close to zero in the regions with a small amplitude of the field, the phase functions changed only weakly in these regions and remained close to the initial zero value.
Figure 4 shows the calculated intensity distributions generated by the calculated single and cascaded DOEs at different incident beams of Equation (23).In order to characterize the quality of the generated distributions, let us use the energy efficiencies E f f j and rootmean-square errors δ j .The energy efficiencies describe the fraction of the energy E 0,j = w 0,j (u 0 ) 2 d 2 u 0 of the j-th incident beam, which arrives to the required region G j = u n+1 | I j (u n+1 ) = 0 .The root-mean-square errors  From Figure 4, it is evident that the quality of the generated distributions increases with an increase in the number of DOEs.In particular, for the single DOE, the required distributions are generated with extremely large root-mean-square errors (being close to or even exceeding 100%) and at relatively low energy efficiencies (less than 54%).For a cascaded structure containing three DOEs, the root-mean-square error significantly decreases (the maximum error, which corresponds to the distribution I 1 (u n+1 ), amounts to 9.8%), and the energy efficiency exceeds 87%.
Thus, the presented examples demonstrate the advantages of cascaded DOEs over single ones in the problem of generating different required intensity distributions for different incident beams and confirm the high performance of the proposed design method.

Design of Cascaded DOEs for Classifying Handwritten Digits
In this subsection, we will consider the design of DOEs for classifying handwritten digits from the MNIST database [25].Let us start by considering the case of a single DOE.In the calculations, the input images of the digits were defined on a 56 × 56 grid with the step of d = 18 µm.The phase function of the DOE was defined on a 512 × 512 grid with the same step.Let the DOE and the output plane be located at z = f 1 = 300 mm and z = f 2 = 600 mm, respectively.Let us note that at the design wavelength λ = 532 nm, the diffraction angle at a pixel of the input distribution amounts to ϕ = arcsin(λ/d) ≈ 1.7 • .In this case, the diffraction pattern from the pixel (with respect to the first minimum) at the distance f 1 = 300 mm roughly covers the DOE aperture.In this regard, the chosen parameters ensure the "connection" of each pixel of the input image with all the pixels (grid nodes), at which the phase function of the DOE is defined.
In accordance with the design method, in the output plane, 10 spatially separated square regions G j with the side length of 0.5 mm were defined, in which maximum energies for different input images of different digits have to be generated (see Figure 5).From Figure 4, it is evident that the quality of the generated distributions increases with an increase in the number of DOEs.In particular, for the single DOE, the required distributions are generated with extremely large root-mean-square errors (being close to or even exceeding 100%) and at relatively low energy efficiencies (less than 54%).For a cascaded structure containing three DOEs, the root-mean-square error significantly de-  roughly covers the DOE aperture.In this regard, the chosen parameters ensure the "connection" of each pixel of the input image with all the pixels (grid nodes), at which the phase function of the DOE is defined.

Cascade of Three DOEs
In accordance with the design method, in the output plane, 10 spatially separated square regions j G with the side length of 0.5 mm were defined, in which maximum en- ergies for different input images of different digits have to be generated (see Figure 5).In the calculation, a training set containing 60,000 images of digits from the MNIST database was used.The DOE was calculated using batch training, with each batch containing 60 randomly chosen digits.As the error functionals, the quadratic error (QE) functional of Equations ( 14) and ( 17) and the softmax cross entropy (SCE) functional of Equations ( 14) and ( 20 After training, "blind" testing of the performance of the calculated DOEs was performed using a test set consisting of 10,000 images not included in the training set.For each image from the test set, the generated intensity distribution was simulated, the energies (16) in the regions j G were calculated, and then the input digit was determined us- ing the maximum energy value.The testing results represented as confusion matrices and energy distribution matrices are represented in Figure 7.The element (i,j) of the confusion matrix contains the percentage of cases, in which an input image of the digit j was recognized as the digit i. Accordingly, the diagonal elements of these matrices contain the percentage of the correct classifications.Similarly, the element (i,j) of the energy distribution matrix contains the averaged energy (in percent) in the region i G at an input image of the digit j.The diagonal elements of this matrix correspond to mean energies (in percent) in the "correct" regions corresponding to each digit.the calculation, a training set containing 60,000 images of digits from the MNIST database was used.The DOE was calculated using batch training, with each batch containing 60 randomly chosen digits.As the error functionals, the quadratic error (QE) functional of Equations ( 14) and ( 17) and the softmax cross entropy (SCE) functional of Equations ( 14) and ( 20) were used.As the initial approximation for the DOE phase function, a random phase from the range [0, 2π) was chosen.In the DOE calculation, 10 epochs were performed, which takes approximately 7 min on a NVIDIA GTX 1070 8 Gb graphics card.Under an epoch, we understand the training of the DOE on 1000 batches containing all the images from the training set.The phase functions of the DOEs calculated using the QE and SCE criteria are shown in Figure 6.For the DOE calculated using the QE criterion [Figure 6a], the accuracy of the digit recognition varies from 93.9% for the digit "9" to 99.2% for the digit "1".The overall classification accuracy (i.e., the ratio of the number of correctly recognized digits to the total amount of digits in the test set) amounts to 97.2%.For the DOE calculated using the SCE criterion [Figure 6b], the accuracy varies from 91.9% for the digit "8" to 99.5% for the digit "0", and the overall accuracy equals 96.8%.Let us note that the achieved classification accuracy values are quite high for single DOEs.For the sake of comparison, the overall classification accuracies in Refs.[3,5,21] achieved using cascaded structures containing 5-10 DOEs vary from 91.8% to 93.4%.
As it was noted above, for the DOE calculated using the SCE criterion, the overall classification accuracy turned out to be 0.4% lower.At the same time, the energy distribution matrix for this DOE is better.Indeed, from the practical point of view, an important After training, "blind" testing of the performance of the calculated DOEs was performed using a test set consisting of 10,000 images not included in the training set.For each image from the test set, the generated intensity distribution was simulated, the energies (16) in the regions G j were calculated, and then the input digit was determined using the maximum energy value.The testing results represented as confusion matrices and energy distribution matrices are represented in Figure 7.The element (i,j) of the confusion matrix contains the percentage of cases, in which an input image of the digit j was recognized as the digit i. Accordingly, the diagonal elements of these matrices contain the percentage of the correct classifications.Similarly, the element (i,j) of the energy distribution matrix contains the averaged energy (in percent) in the region G i at an input image of the digit j.The diagonal elements of this matrix correspond to mean energies (in percent) in the "correct" regions corresponding to each digit.
For the DOE calculated using the QE criterion (Figure 6a), the accuracy of the digit recognition varies from 93.9% for the digit "9" to 99.2% for the digit "1".The overall classification accuracy (i.e., the ratio of the number of correctly recognized digits to the total amount of digits in the test set) amounts to 97.2%.For the DOE calculated using the SCE criterion (Figure 6b), the accuracy varies from 91.9% for the digit "8" to 99.5% for the digit "0", and the overall accuracy equals 96.8%.Let us note that the achieved classification accuracy values are quite high for single DOEs.For the sake of comparison, the overall classification accuracies in Refs.[3,5,21] achieved using cascaded structures containing 5-10 DOEs vary from 91.8% to 93.4%.As an example, Figure 8 shows a typical input image of the digit "3" and the corresponding energy distribution demonstrating a correct digit recognition.As it was noted above, for the DOE calculated using the SCE criterion, the overall classification accuracy turned out to be 0.4% lower.At the same time, the energy distribution matrix for this DOE is better.Indeed, from the practical point of view, an important parameter is the contrast value, which shows, how much the energy in the required region exceeds the energy values in the other regions.Let us define the contrast for the digit i as where I i,j , i, j = 0, . . ., 9 are the elements of the energy distribution matrix.For robust determination of the "true maxima", it is necessary for the contrast values γ min,i to exceed 0.1.According to the energy distribution matrix shown in Figure 7b and corresponding to the DOE calculated using the QE criterion, the minimum contrast is achieved for the digit "9" and amounts to γ min ≈ 0.11.For the energy distribution matrix of Figure 7d corresponding to the DOE calculated using the SCE criterion, the minimum contrast is also achieved for the digit "9" but is somewhat greater: γ min ≈ 0.17.
As an example, Figure 8 shows a typical input image of the digit "3" and the corresponding energy distribution demonstrating a correct digit recognition.As an example, Figure 8 shows a typical input image of the digit "3" and the corresponding energy distribution demonstrating a correct digit recognition.Then, using the QE and SCE criteria, we designed cascaded DOEs comprising two DOEs located in the planes z = f 1 = 300 mm and z = f 2 = 600 mm.The output plane was located at z = 900 mm.All the other parameters (discretization, wavelength, and aperture sizes) coincide with the parameters of the examples considered above.The phase functions of the cascaded DOEs calculated after 10 epochs are shown in Figure 9.
The confusion matrices and the energy distribution matrices for the designed cascaded DOEs are presented in Figure 10.As before, the DOE performance was evaluated on a test set containing 10,000 images not included in the training set.By comparing the confusion matrices for single and cascaded DOEs (Figures 7a,c and 10a,c), one can see an increase in the classification accuracy.The overall accuracy values for the cascaded DOEs calculated using the QE and SCE criteria amount to 98.0% and 97.6%, respectively.Thus, for the considered example, the increase in the classification accuracy achieved by using a cascaded structure containing two DOEs equals 0.8%.The energy distribution matrices for the cascaded DOEs (Figure 10b,d) are also improved.In particular, minimum contrast values for the cascaded DOEs, which are also achieved for the digit "9", amount to 0.19 and 0.31 for the QE and SCE criteria, respectively.These contrast values are more than 1.7 times greater than those for single DOEs.The confusion matrices and the energy distribution matrices for the designed cascaded DOEs are presented in Figure 10.As before, the DOE performance was evaluated on a test set containing 10,000 images not included in the training set.By comparing the confusion matrices for single and cascaded DOEs [Figures 7a,c and 10a,c], one can see an increase in the classification accuracy.The overall accuracy values for the cascaded DOEs calculated using the QE and SCE criteria amount to 98.0% and 97.6%, respectively.Thus, for the considered example, the increase in the classification accuracy achieved by using a cascaded structure containing two DOEs equals 0.8%.The energy distribution matrices for the cascaded DOEs [Figure 10b,d] are also improved.In particular, minimum contrast values for the cascaded DOEs, which are also achieved for the digit "9", amount to 0.19 and 0.31 for the QE and SCE criteria, respectively.These contrast values are more than 1.7 times greater than those for single DOEs.
Let us note that a further increase in the number of DOEs leads to only a marginal increase in the classification accuracy but enables improving the contrast values.In particular, for a cascaded structure consisting of three DOEs calculated using the SCE criterion, the minimum contrast amounts to 0.55, which is significantly greater than the value of 0.31 provided by the cascaded structure of two DOEs.
Another way to increase the DOE performance consists in increasing the number of the optimized parameters, which can be achieved by decreasing the step of the grid, at which the phase functions of the DOEs are defined.Let us note that a further increase in the number of DOEs leads to only a marginal increase in the classification accuracy but enables improving the contrast values.In particular, for a cascaded structure consisting of three DOEs calculated using the SCE criterion, the minimum contrast amounts to 0.55, which is significantly greater than the value of 0.31 provided by the cascaded structure of two DOEs.
Another way to increase the DOE performance consists in increasing the number of the optimized parameters, which can be achieved by decreasing the step of the grid, at which the phase functions of the DOEs are defined.For example, a single DOE with the step size d = 4 µm (and the rest of the parameters coinciding with those of the single DOE examples considered above) calculated using the QE criterion provides the overall accuracy of 97.9% and minimum contrast of 0.16, which is considerably better than in the case of a single DOE with the larger step size of 18 µm [see Figure 7a,b].It is worth noting that this result is comparable with the performance of the cascaded structure of two DOEs with the 18 µm step size [see Figure 10a,b].
From the practical point of view, it is important to discuss the misalignment issues, which will inevitably occur when implementing cascaded DOEs (DNNs).It is known that alignment errors smaller than the neuron (DOE pixel) size show a minor influence on the DNN performance [3,21].When the alignment error is just bigger than the neuron size, the classification accuracy can be drastically reduced.It should also be noted that the longitudinal misalignment usually influences the performance of a DNN much less than the lateral (transverse) one [21].
The cascaded DOEs studied in this work are no exception.In order to estimate the influence of DOE misalignment, as an example, let us consider the cascaded DOE comprising two DOEs and designed using the SCE criterion (Figure 9c,d).The simulation results demonstrate that when the first DOE is laterally displaced by the vectors ∆ = (∆u 1 , ∆v 1 ) = (±18, ±18) µm (in the case of a fixed position of the second DOE), the overall classification accuracy remains greater than 95% (i.e., the decrease in the overall accuracy does not exceed 3%).The minimum contrast in this case also stays acceptable and exceeds 0.12.At further increase of the lateral displacement, the accuracy and contrast decrease more significantly: for example, at the lateral displacement ∆ = (36, 36) µm, the overall accuracy and minimum contrast amount to 90.9% and 0.09, respectively.The lateral displacement of the second DOE influences the performance somewhat less, e.g., at the (36, 36) µm displacement, the overall accuracy equals 96.3%, whereas the minimum contrast is 0.13.Similar to the results presented in [21], the longitudinal misalignment is much less critical: for example, the displacement of each of the DOEs along the optical axis by 200 µm leads to a decrease in the overall efficiency not exceeding 0.1% at virtually the same contrast.From the practical point of view, it is important to discuss the misalignment issues, which will inevitably occur when implementing cascaded DOEs (DNNs).It is known that alignment errors smaller than the neuron (DOE pixel) size show a minor influence on the DNN performance [3,21].When the alignment error is just bigger than the neuron size, the classification accuracy can be drastically reduced.It should also be noted that the longitudinal misalignment usually influences the performance of a DNN much less than the lateral (transverse) one [21].
The cascaded DOEs studied in this work are no exception.In order to estimate the influence of DOE misalignment, as an example, let us consider the cascaded DOE comprising two DOEs and designed using the SCE criterion [Figure 9c,d].The simulation results demonstrate that when the first DOE is laterally displaced by the vectors (in the case of a fixed position of the second DOE), the overall classification accuracy remains greater than 95% (i.e., the decrease in the overall accuracy does not exceed 3%).The minimum contrast in this case also stays acceptable and exceeds 0.12.At further increase of the lateral displacement, the accuracy and contrast

Conclusions
In this work, we presented a gradient method for calculating cascaded DOEs.Using the unitarity property of the operator of light propagation through the cascaded DOE, we obtained explicit expressions for the derivatives of the general error functional with respect to the phase functions of the cascaded DOE.We considered the application of the gradient method to the problem of focusing different incident beams to regions with different intensity distributions and to the problem of image classification.The presented description of the gradient method unifies the problems of designing cascaded DOEs for focusing laser radiation and for image classification within a general methodological approach, in which the calculation of the derivatives of the error functionals is reduced to the same formula.
Using the proposed gradient method, we designed single and cascaded DOEs for solving the problem of focusing different incident beams on different regions and the problem of classifying handwritten images.The presented numerical simulation results demonstrate the high performance of the proposed method.In particular, it was shown that a single DOE enables solving the classification problem with an accuracy of approximately 97%, whereas a cascaded structure containing two DOEs provides a 98% accuracy.
In the opinion of the authors, the presented gradient method can be generalized to the case of generating required intensity distributions in the case of several incident beams with different wavelengths.The design and investigation of such DOEs operating at several wavelengths will be the subject of further research.

Figure 1 .
Figure 1.Geometry of the problem of the design of a cascaded DOE.
the complex amplitudes of the fields incident on DOEs located in the planes = between these planes.According to Equation (1), the calculation of the complex amplitude of the output field recursively.For the following analysis, it will be convenient to consider Equation (1) as a representation of linear operators describing the propagation of light from the input plane = 0 ..., n n u u from the condition of generating a light field with the re- plane.Let us describe the error of generating the required distribution using an integral criterion

Figure 1 .
Figure 1.Geometry of the problem of the design of a cascaded DOE.

Figure 2 .
Figure 2. Required intensity distributions in the output plane for the incident beams of Equation (14).We will consider three design examples: a single DOE (located in the plane z = f 1 = 300 mm) and cascaded DOEs consisting of two DOEs (located at z = f 1 = 200 mm and z = f 2 = 400 mm) and three DOEs (located at z = f 1 = 150 mm, z = f 2 = 300 mm, and z = f 3 = 450 mm).We will define the phase functions in the DOE planes on

Figure 3 .
Figure 3.The calculated phase functions of the single DOE (first row) and of cascaded structures containing two (second row) and three (third row) DOEs.Figure 3. The calculated phase functions of the single DOE (first row) and of cascaded structures containing two (second row) and three (third row) DOEs.

Figure 4 .
Figure 4. Calculated intensity distributions generated for each of the incident beams of Equation (23) by the designed single DOE (first row) and cascaded structures consisting of two (second row) and three (third row) DOEs.

EffFigure 4 .
Figure 4. Calculated intensity distributions generated for each of the incident beams of Equation (23) by the designed single DOE (first row) and cascaded structures consisting of two (second row) and three (third row) DOEs.

Figure 5 .
Figure 5. Regions j G in the output plane, in which maximum energies for the input images of different digits have to be generated.
) were used.As the initial approximation for the DOE phase function, a random phase from the range )  π  0,2 was chosen.In the DOE calculation, 10 epochs were performed, which takes approximately 7 min on a NVIDIA GTX 1070 8 Gb graphics card.Under an epoch, we understand the training of the DOE on 1000 batches containing all the images from the training set.The phase functions of the DOEs calculated using the QE and SCE criteria are shown in Figure 6.

Figure 5 .
Figure 5. Regions G j in the output plane, in which maximum energies for the input images of different digits have to be generated.

Figure 6 .
Figure 6.Phase functions of DOEs calculated using the quadratic error criterion (a) and the softmax cross entropy criterion (b).

Figure 6 .
Figure 6.Phase functions of DOEs calculated using the quadratic error criterion (a) and the softmax cross entropy criterion (b).

Figure 7 .
Figure 7. Confusion matrices and energy distribution matrices for the DOEs calculated using the quadratic error criterion (a,b) and the softmax cross entropy criterion (c,d).

Figure 7 .
Figure 7. Confusion matrices and energy distribution matrices for the DOEs calculated using the quadratic error criterion (a,b) and the softmax cross entropy criterion (c,d).

Figure 7 .
Figure 7. Confusion matrices and energy distribution matrices for the DOEs calculated using the quadratic error criterion (a,b) and the softmax cross entropy criterion (c,d).

Figure 8 ..
Figure 8. Input image of the digit "3" (a) and the corresponding energy distribution (b).Then, using the QE and SCE criteria, we designed cascaded DOEs comprising two DOEs located in the planes = = 1 300 m m z f and = = 2 600 m m z f.The output plane was

Figure 8 .
Figure 8. Input image of the digit "3" (a) and the corresponding energy distribution (b).

Figure 9 .
Figure 9. Phase functions of the cascaded DOEs calculated using the quadratic error criterion (a,b) and the softmax cross entropy criterion (c,d).
For example, a single DOE with the step size = μ 4 m d (and the rest of the parameters coinciding with those of the single DOE examples considered above) calculated using the QE criterion provides the overall accuracy of 97.9% and minimum contrast of 0.16, which is considerably better than in the case of a single DOE with the larger step size of μ 18 m [see Figure 7a,b].It is worth noting that this result is comparable with the performance of the cascaded structure of two DOEs with the μ 18 m step size [see Figure 10a,b].

Figure 9 .
Figure 9. Phase functions of the cascaded DOEs calculated using the quadratic error criterion (a,b) and the softmax cross entropy criterion (c,d).

Figure 10 .
Figure 10.Confusion matrices and energy distribution matrices for cascaded DOEs calculated using the quadratic error criterion (a,b) and the softmax cross entropy criterion (c,d).