Inverse Design for Silicon Photonics: From Iterative Optimization Algorithms to Deep Neural Networks

: Silicon photonics is a low-cost and versatile platform for various applications. For design of silicon photonic devices, the light-material interaction within its complex subwavelength geometry is difﬁcult to investigate analytically and therefore numerical simulations are majorly adopted. To make the design process more time-efﬁcient and to improve the device performance to its physical limits, various methods have been proposed over the past few years to manipulate the geometries of silicon platform for speciﬁc applications. In this review paper, we summarize the design methodologies for silicon photonics including iterative optimization algorithms and deep neural networks. In case of iterative optimization methods, we discuss them in different scenarios in the sequence of increased degrees of freedom: empirical structure, QR-code like structure and irregular structure. We also review inverse design approaches assisted by deep neural networks, which generate multiple devices with similar structure much faster than iterative optimization methods and are thus suitable in situations where piles of optical components are needed. Finally, the applications of inverse design methodology in optical neural networks are also discussed. This review intends to provide the readers with the suggestion for the most suitable design methodology for a speciﬁc scenario.


Introduction
Silicon photonics enables massive fabrication of integrated photonic devices at low cost, due to its compatibility with complementary metal oxide semiconductor (CMOS) process [1]. It has emerged as one of the most important technologies for data communication [2] and computations [3], as integrated electronic circuits are reaching their limits as well as incurring high energy costs. Recently, it has also shown the potential for other applications such as sensing [4] and light detection and ranging (LiDAR) [5]. For miscellaneous silicon photonic devices, their versatile functions mainly come from design of device geometry [6]. Intuitively, the design of single component with specific function is fulfilled by physics-based methods like analytical models, prior practical experiences and scientific intuitions [7]. However, the physics-based methods are laborious and have high requirements on designers' experience. Furthermore, the performance evaluation of lightmaterial interaction with complicated geometries has to rely on numerical electromagnetic (EM) simulations [8], as the computation for irregular geometry is often non-intuitive. To make the design process more efficient, inverse design approaches have been introduced, which are assisted by various iterative optimization methods and deep neural networks (DNNs) [6][7][8][9][10][11][12].
Iterative optimization algorithms can optimize device geometric parameters for targeted optical performance. They have been widely researched recently due to their timeefficiency and satisfactory optimization results. For devices with regular structure, key geometry parameters, which suggest the degrees of freedom (DOF) during design, are abstracted and tuned for targeted optical performance. For devices with one or two DOF, the optimal optical performance can be found by sweeping all the possible solutions. However, the sweeping process would be extremely laborious and require large time budget for higher DOF, as EM simulations are time-consuming. The relationships between geometry parameters and optical performance are quite similar to mathematical non-deterministic polynomial hard (NP-hard) problems [13] which are difficult to define by explicit function. Inspired by iterative gradient-free algorithms such as genetic algorithm (GA), particle swarm optimization (PSO) and direct binary search (DBS) for solving NP-hard problems, lots of regular silicon photonic devices are optimized in these ways [14]. Apart from gradient-free algorithms, gradient-based algorithms like topology optimization (TO) for devices with irregular structure have also been investigated for inverse design recently [15]. Instead of abstracting key parameters that describes device geometry, TO takes the whole design space as a bitmap, where all the pixels are iteratively updated towards the direction of gradient descent until the gradients are small enough. In this way, complex geometric pattern could be quickly generated for a targeted optical performance.
The iterative optimization algorithms discussed above could optimize only one device for one-round optimization and the whole optimization process has to be repeated again when the optical response target varies a little. This is extremely laborious when many similar devices are needed with slightly difference target response, such as power splitters with different splitting ratios or grating couplers (GCs) working at different wavelength bands [16]. Fortunately, DNN, an unprecedented tool capable of building a bridge between data features and data labels, has also found its way to assist the design of silicon photonic devices [6]. Based on the mapping direction between geometric parameters and optical response, DNNs are trained as forward models and inverse models. A forward model maps geometric parameter of a device to its optical response. It surrogates EM solvers to accelerate device performance evaluation in optimizations. On the contrary, an inverse model maps desired optical response to geometric parameters. It takes almost no time to generate a device for a specific demand after the model is well-trained, eliminating the need for optimizations.
There are already some excellent review papers about optimization algorithms and DNNs for nanophotonics inverse design. For example, Molesky et al. focused on the adjoint method and its applications [7] and Jensen et al. concentrated on topology optimization of nanophotonics [15]. More recently, Jiang et al. and So et al. presented very detailed reviews about DNNs-assisted metasurface and nanophotonics design [6,8]. Different from these previous reviews, we review design methodologies for silicon photonics from the perspective of demands. According to different design goals for a single device or multiple similar devices, we separate the design methodologies into iterative optimization algorithms and DNNs-assisted methods. The iterative optimization algorithms are illustrated in the sequence of increased DOF: empirical structure, QR-code like structure and topology structure. We also separate DNNs into discriminative neural networks and generative neural networks which are trained as forward models and inverse models, respectively. Finally, we point out the challenges of existing design methodologies and highlight a future research direction: inverse designed optical neural networks (ONN).

Inverse Design Schemes for Silicon Photonics
Different from electronic circuits where the versatile functions come from the combination of electronic components, a single silicon photonic device is able to achieve complex functions with delicate geometry design. For the given geometry of a device, its optical response can be evaluated by EM simulation before experimental validation. EM simulation Appl. Sci. 2021, 11, 3822 3 of 25 tools have been developed by solving Maxwell's equations numerically. These tools are important for device modelling and optimization, by allowing the designer to understand how light propagates in a given geometry and to obtain an accurate optical response. Widely adopted EM simulation methods include finite element method (FEM) [17], finitedifference time-domain method (FDTD) [18], eigenmode expansion method (EME) [19] and rigorous coupled wave analysis (RCWA) [20]. The FDTD method is among the most popular choices for silicon photonic device since it is a general EM solver well suited for high index contrast structures with complex geometry. It also allows wideband simulation as spectral response is generally needed for silicon photonic devices.
For a targeted optical response, inverse design tries to find a suitable device geometry. As shown in Figure 1, either a certain electric field distribution or an optical spectrum could serve as the figure of merit (FOM) which describes the phenomenon of interest to the designer. Other merits such as performance, compactness or fabrication feasibility may also be considered for different application scenarios. Based on the number of DOF, three types of structures are widely used for inverse design: empirical structure [21][22][23][24][25][26][27][28][29][30][31][32], QR-code like structure [33][34][35][36][37][38][39][40][41][42][43][44][45] and irregular structure [46][47][48][49][50][51][52][53][54][55][56][57][58]. An empirical structure making use of existing classical structures usually has only a few DOF. A QR-code like structure with hundreds of DOF selectively etches the design area with regular shapes such as rectangles or circles in a periodic manner. An irregular structure taking the whole design area as a bitmap has the most complex pattern with hundreds of thousands DOF. In the rest of this section, iterative optimization algorithms for optimizing these structures are illustrated and compared in detail.

Inverse Design Schemes for Silicon Photonics
Different from electronic circuits where the versatile functions come from the combination of electronic components, a single silicon photonic device is able to achieve complex functions with delicate geometry design. For the given geometry of a device, its optical response can be evaluated by EM simulation before experimental validation. EM simulation tools have been developed by solving Maxwell's equations numerically. These tools are important for device modelling and optimization, by allowing the designer to understand how light propagates in a given geometry and to obtain an accurate optical response. Widely adopted EM simulation methods include finite element method (FEM) [17], finite-difference time-domain method (FDTD) [18], eigenmode expansion method (EME) [19] and rigorous coupled wave analysis (RCWA) [20]. The FDTD method is among the most popular choices for silicon photonic device since it is a general EM solver well suited for high index contrast structures with complex geometry. It also allows wideband simulation as spectral response is generally needed for silicon photonic devices.
For a targeted optical response, inverse design tries to find a suitable device geometry. As shown in Figure 1, either a certain electric field distribution or an optical spectrum could serve as the figure of merit (FOM) which describes the phenomenon of interest to the designer. Other merits such as performance, compactness or fabrication feasibility may also be considered for different application scenarios. Based on the number of DOF, three types of structures are widely used for inverse design: empirical structure [21][22][23][24][25][26][27][28][29][30][31][32], QR-code like structure [33][34][35][36][37][38][39][40][41][42][43][44][45] and irregular structure [46][47][48][49][50][51][52][53][54][55][56][57][58]. An empirical structure making use of existing classical structures usually has only a few DOF. A QR-code like structure with hundreds of DOF selectively etches the design area with regular shapes such as rectangles or circles in a periodic manner. An irregular structure taking the whole design area as a bitmap has the most complex pattern with hundreds of thousands DOF. In the rest of this section, iterative optimization algorithms for optimizing these structures are illustrated and compared in detail.

Optimization of Empirical Structures
An empirical structure makes use of classical structures, where only a few parameters like radius, duty cycle, widths of waveguides, etc. are tuned for targeted optical response. These classical structures are proposed by researchers for different application scenarios. For example, micro-rings are preferred for wavelength selection due to their high wavelength-sensitivity [21,27]. Directional couplers (DCs) are used for polarization beam splitters (PBS), mode-division multiplexers (MDM) and wavelength-division multiplexers (WDM) since DCs couple light from one waveguide to another in a reciprocating way [21,24,26]. Y-splitter with three ports is widely used for power splitting (PS) [59]. Mach-

Optimization of Empirical Structures
An empirical structure makes use of classical structures, where only a few parameters like radius, duty cycle, widths of waveguides, etc. are tuned for targeted optical response. These classical structures are proposed by researchers for different application scenarios. For example, micro-rings are preferred for wavelength selection due to their high wavelength-sensitivity [21,27]. Directional couplers (DCs) are used for polarization beam splitters (PBS), mode-division multiplexers (MDM) and wavelength-division multiplexers (WDM) since DCs couple light from one waveguide to another in a reciprocating way [21,24,26]. Y-splitter with three ports is widely used for power splitting (PS) [59]. Mach-Zehnder interferometer (MZI) can separate light and finally combine with optical path difference, so that it could be utilized for optical switches or modulators [30,60,61]. Gratings which periodically arrange the silicon and air (or silica) have been applied to manipulate the effective refractive index of waveguides or act as fiber grating couplers (GC) [62,63]. With known structures and underlying theories to design a device for targeted FOM, it still costs lots of time to tailor the key parameters, as each parameter set requires at least one EM simulation.
Appl. Sci. 2021, 11, 3822 4 of 25 The relationship between the abstracted key parameters from an empirical structure and its optical response is a kind of NP hard problem, which can be solved by heuristic algorithms. GA, a kind of heuristic optimized algorithm, is inspired by Darwinian evolution theory of survival of the fittest [64]. After initial chromosomes, which represent different parameter sets, are randomly generated, the fitness values describing the FOM of each chromosome are computed as shown in Figure 2a. Chromosome with better fitness values will be chosen as elite and passed to the next generation directly. For the rest of chromosomes, some chromosomes experience crossover with others to create new chromosomes and some produce offspring by mutation. Through the above elite selection, crossover and mutation, a new generation is born. This procedure is repeated until the best FOM is found. ferent parameter sets, are randomly generated, the fitness values describing the FOM of each chromosome are computed as shown in Figure 2a. Chromosome with better fitness values will be chosen as elite and passed to the next generation directly. For the rest of chromosomes, some chromosomes experience crossover with others to create new chromosomes and some produce offspring by mutation. Through the above elite selection, crossover and mutation, a new generation is born. This procedure is repeated until the best FOM is found.
We take reference [25] for example to illustrate utilizing GA to optimize an optical device with empirical structure. A broadband PBS based on directional coupler and subwavelength gratings (SWGs) structure is designed as shown in Figure 2b. Key parameters like pitch Λ, duty cycle α and coupling length L s are abstracted to represent this structure and coded as a chromosome. Initial chromosomes (20 parameters sets) are generated randomly within parameter boundaries. The FOM of each chromosome is calculated with respect to the output of transverse electric (TE) mode at through waveguide and transverse magnetic (TM) mode at cross waveguide. After hundreds of iterations, the best parameter set is found as Figure 2c suggests, where the transmission efficiencies for both TE and TM modes at desired output ports are larger than 85% over wavelength range from 1250 nm to 1680 nm. Another popular heuristic algorithm is PSO [65]. Different form GA, where parts of parameter sets are updated at each iteration, PSO updates all the parameter sets towards the global optimized FOM. For PSO, the initial randomly generated parameter sets are called particles, each of which is assigned an initial velocity. PSO evaluates the FOM of each particle and finds the best global and local FOMs as well as the corresponding particles. According to the global best particle and the local best particle at their current iteration, the velocity of each particle is updated. With updated velocities, the new population We take reference [25] for example to illustrate utilizing GA to optimize an optical device with empirical structure. A broadband PBS based on directional coupler and subwavelength gratings (SWGs) structure is designed as shown in Figure 2b. Key parameters like pitch Λ, duty cycle α and coupling length L s are abstracted to represent this structure and coded as a chromosome. Initial chromosomes (20 parameters sets) are generated randomly within parameter boundaries. The FOM of each chromosome is calculated with respect to the output of transverse electric (TE) mode at through waveguide and transverse magnetic (TM) mode at cross waveguide. After hundreds of iterations, the best parameter set is found as Figure 2c suggests, where the transmission efficiencies for both TE and TM modes at desired output ports are larger than 85% over wavelength range from 1250 nm to 1680 nm.
Another popular heuristic algorithm is PSO [65]. Different form GA, where parts of parameter sets are updated at each iteration, PSO updates all the parameter sets towards the global optimized FOM. For PSO, the initial randomly generated parameter sets are called particles, each of which is assigned an initial velocity. PSO evaluates the FOM of each particle and finds the best global and local FOMs as well as the corresponding particles. According to the global best particle and the local best particle at their current iteration, the velocity of each particle is updated. With updated velocities, the new population are generated from original particles. This updating procedure is repeated until the global best FOM is converged to an acceptable value. PSO has also been widely used for the design of silicon photonic devices with empirical structures as listed in Table 1.

Optimization of QR-Code like Structures
A QR-code like structure, where regular shapes like circles or rectangles are selectively etched on a periodic 2-D surface, is also an important approach for manipulating light propagation in waveguides. Different from empirical structure-based design, QR-code like Appl. Sci. 2021, 11, 3822 5 of 25 structure with hundreds of DOF is a more general approach for different problems. In a QR-code like structure, designers only have to define the pitch and size of etched holes as well as the footprint of design area. For each hole, to be etched or not is determined automatically to achieve better FOM. Using a compact QR-code like structure, many different functional components like power splitter [41,45], polarization rotator [33,42,44], PBS [40], WDM [42], MDM [34,39], crossing [66], GC [67] and photonic crystal [68] have been designed.
DBS, a brute-force searching algorithm, initially proposed by Seldowitz et al. for the synthesis of digital holograms [69], has been proven to be efficient for the design of devices with QR-code structure. Shen et al. introduced DBS to design a very compact integrated PBS on SOI platform [40], as shown in Figure 3. They defined the size of design area to be 2.4 × 2.4 µm 2 with each hole size to be 0.12 × 0.12 µm 2 . There are 400 rectangular pixels to determine whether to etch or not. During each iteration, one pixel in the design area is reversed and the FOM (average transmission efficiency of TE and TM modes at the targeted output ports) is calculated and compared with the FOM for last iteration. If the new FOM is larger than the old one, the new geometry with reversed pixel is past to the next generation, otherwise, the old geometry is kept. The above procedure is repeated until the FOM is large enough to be acceptable. After about 140 h of optimization, the average transmission efficiency of optimized PBS is higher than 70% and its 1 dB bandwidth is 83 nm.
design of silicon photonic devices with empirical structures as listed in Table 1.

Optimization of QR-Code like Structures
A QR-code like structure, where regular shapes like circles or rectangles are selectively etched on a periodic 2-D surface, is also an important approach for manipulating light propagation in waveguides. Different from empirical structure-based design, QRcode like structure with hundreds of DOF is a more general approach for different problems. In a QR-code like structure, designers only have to define the pitch and size of etched holes as well as the footprint of design area. For each hole, to be etched or not is determined automatically to achieve better FOM. Using a compact QR-code like structure, many different functional components like power splitter [41,45], polarization rotator [33,42,44], PBS [40], WDM [42], MDM [34,39], crossing [66], GC [67] and photonic crystal [68] have been designed.
DBS, a brute-force searching algorithm, initially proposed by Seldowitz et al. for the synthesis of digital holograms [69], has been proven to be efficient for the design of devices with QR-code structure. Shen et al. introduced DBS to design a very compact integrated PBS on SOI platform [40], as shown in Figure 3. They defined the size of design area to be 2.4 × 2.4 μm 2 with each hole size to be 0.12 × 0.12 μm 2 . There are 400 rectangular pixels to determine whether to etch or not. During each iteration, one pixel in the design area is reversed and the FOM (average transmission efficiency of TE and TM modes at the targeted output ports) is calculated and compared with the FOM for last iteration. If the new FOM is larger than the old one, the new geometry with reversed pixel is past to the next generation, otherwise, the old geometry is kept. The above procedure is repeated until the FOM is large enough to be acceptable. After about 140 h of optimization, the average transmission efficiency of optimized PBS is higher than 70% and its 1 dB bandwidth is 83 nm. In addition to DBS, heuristic algorithms like GA and PSO have also been applied for the design of integrated silicon photonic devices based on QR-code like structure. Different from DBS, which updates one pixel at each iteration, heuristic algorithms code all the pixels as an individual and update them together during each iteration, hence the total number of iterations could be reduced.

Optimization of Irregular Structures
An irregular structure owns the highest DOF since the whole design area is segmented into very small pixels [15]. Different from QR-code like structure which also pixelates the design area with 100 or 200 nm resolution, an irregular structure is more intricate with ultra-high resolution (10 or 20 nm scale). It would be extremely time-consuming for gradient-free algorithms discussed above to optimize an irregular structure with thou- In addition to DBS, heuristic algorithms like GA and PSO have also been applied for the design of integrated silicon photonic devices based on QR-code like structure. Different from DBS, which updates one pixel at each iteration, heuristic algorithms code all the pixels as an individual and update them together during each iteration, hence the total number of iterations could be reduced.

Optimization of Irregular Structures
An irregular structure owns the highest DOF since the whole design area is segmented into very small pixels [15]. Different from QR-code like structure which also pixelates the design area with 100 or 200 nm resolution, an irregular structure is more intricate with ultrahigh resolution (10 or 20 nm scale). It would be extremely time-consuming for gradient-free algorithms discussed above to optimize an irregular structure with thousands of DOF, as the required simulations is proportional to the number of DOF. Fortunately, gradientbased algorithms cast light on this time-consuming problem, where all the parameters are updated with the fewest simulations.
To describe the optimization process, a mathematical model which tries to optimize the FOM at the constraints of Maxwell equations is built as in Equation (1): where FOM F(·) is the function of electric field E and magnetic field H. Permittivity and permeability µ in the design area are variables to be tuned for better FOM. Objective-first (OF) method [56] and topology optimization (TO) with adjoint method [70] have been proposed to solve Equation (1). We take irregular PBS in reference [71] as an example to illustrate the TO process. As shown in Figure 4a, the design domain D s is optimized by adjoint method for two FOMs. The first FOM is the transmission of TE mode at the upper arm while the second FOM is the transmission of TM mode at the lower arm.
On the other hand, for density method, the values of continuous variable range from 0 to 1, and the permittivity in the design domain is expressed as in Equation (13): Different from level-set method which is discretized after applying Equation (12), the density method would generate "gray structures" with continuous permittivity between and . To further discretize permittivity, Su et al. proposed a discretization method by introducing self-biasing or neighbor-biasing [46]. In the example of PBS design, density method is adopted and the optimization process is shown in Figure 4b, where an initial random structure is finally optimized and discretized after 278 iterations.
Apart from discretization of permittivity, minimal feature size control is also an important issue for irregular structures as the large DOF bring features with critical size too small to be fabricated. Different approaches are proposed to solve this issue, such as density filters [75], penalty functions [76], artificial damping [50] and morphological filters [77]. Recently, Khoram et al. introduced b-splines to control minimal feature size of irregular structures [78]. Instead of filtering out small features directly, this method transfers the design domain to lower space dimension composed of b-spline functions, which damps small features internally. In the example of PBS design, the optimized device geometry is shown in Figure 4c, where density adjoint method with b-spline function is adopted. Within the footprint of 2.4 × 2.8 μm 2 , this optimized PBS splits TE and TM modes with over 90% transmission efficiency covering 420-nm wavelength range.

Comparison of Iterative Optimization Algorithms for Silicon Photonics Design
We have discussed three examples of inverse designed PBSs via different initial structures with corresponding iterative optimization algorithms in the above three subsections. The PBS based on empirical structure has high transmission efficiency (over 85% within 420 nm wavelength range) for both TE and TM modes and the critical size is under the control of designers. However, it also has the largest footprint (approximately 7 × 4 μm 2 ) and the lowest DOF (only 3). It takes about one week to optimize this device on two 12-core central processing units (CPUs). Profound prior knowledge about the coupling theory is also required and a skillful initial structure has to be chosen, otherwise, the device may not work at all. The PBS based on QR-code like structure has ultra-compact size (2.4 × 2.4 μm 2 ) as well as high DOF (i.e., 400). Its design process does not need much prior knowledge but the optimization time is long (~ 140 design hours) as each QR-code has to be verified during DBS. This time-consuming optimization process limits both the size of footprint and the number of DOF, so the transmission efficiency of DBS optimized PBS is As the optimization processes of two arms are similar, we only give detailed derivation of TE mode here. In the objective domain D o , the actual EM fields are optimized towards fundamental TE mode. The targets at each position x o in the objective domain is defined in the form of Poynting vectors [70] as in Equation (2): where E, H are the actual EM fields from the simulation, and E T , H T are the constant conjugated EM fields of the fundamental TE mode. The best target f (E(x o ), H(x o )) requires the value of actual EM fields to be the largest and the direction of actual EM fields to be the same with fundamental TE mode at position x o . However, it is still hard to determine whether the FOM is good or not when some positions get better targets while others get worse. Therefore, the total FOM is defined as the integral of all the targets in the objective domain as in Equation (3):

of 25
The derivative of total FOM to permittivity and permeability in the design domain D s is calculated using chain rule. Firstly, the derivative of total FOM to actual EM fields in the objective domain D o is calculated as in Equation (4): where factors are the constant values and are easy to get from Equation (2) as in Equation (5): However, the terms δE, δH cannot be calculated analytically, as the relationship between E, H at objective domain and , µ at design domain cannot be expressed by an explicit function. Since the variation of permeability δµ causes little change to EM fields, only the variation of permittivity δ is considered. In the design domain, the original geometry has electric field E(x s ) at x s . When a very small volume V at this position has a small permittivity change δ , this will cause an induced dipole moment the objective domain come from the summation effects of this induced dipole moments in the design domain, which can be expressed with Green's function [72] as in Equation (6): where G EP (x o , x s ) and G HP (x o , x s ) represent the EM fields at x o in the objective domain from a unit dipole at x s in the design domain. The symmetry theory suggests the following relation as in Equation (7): By substituting Equations (6) and (7) into Equation (4) and making an equivalent transformation, we get the total FOM as in Equation (8): The inner integration is defined as the "adjoint" electric field E A (x s ) at position x s in design domain as in Equation (9): This adjoint electric field E A (x s ) can be obtained through the integration of all induced electric fields which come from EM dipoles with amplitudes  (10): With one forward simulation which calculates all the electric fields E(x s ) and one inverse simulation which computes all the adjoint electric fields E A (x s ) in the design domain, the derivatives of total FOM to all the permittivity s can be obtained as Equation (10) suggested. As discussed before, there are two FOMs for PBS. For another FOM' which tries to guide TM mode to the second waveguide arm, the derivative of FOM' to permittivity s in the design domain is also calculated in similar way. The permittivity in the design domain is updated towards the direction of total gradient decent as in Equation (11): where α is the updating rate. The updating process is repeated until the gradient is small enough. During the updating process, permittivity s is taken as a continuous variable for convenience of derivative calculation, which has to be discretized for real application scenarios. Both level-set [73] and density optimization [74] have been applied for discretization. In silicon platform, the design domain usually composes of silicon and silica (or air) with permittivity 1 and 2 , respectively. The level-set method defines a continuous variable ϕ s with values ranging from negative to positive. ϕ s = 0 suggests the boundary between two materials and the discrete permittivity in the design domain is defined as in Equation (12): On the other hand, for density method, the values of continuous variable ϕ s range from 0 to 1, and the permittivity in the design domain is expressed as in Equation (13): Different from level-set method which is discretized after applying Equation (12), the density method would generate "gray structures" with continuous permittivity between 1 and 2 . To further discretize permittivity, Su et al. proposed a discretization method by introducing self-biasing or neighbor-biasing [46]. In the example of PBS design, density method is adopted and the optimization process is shown in Figure 4b, where an initial random structure is finally optimized and discretized after 278 iterations.
Apart from discretization of permittivity, minimal feature size control is also an important issue for irregular structures as the large DOF bring features with critical size too small to be fabricated. Different approaches are proposed to solve this issue, such as density filters [75], penalty functions [76], artificial damping [50] and morphological filters [77]. Recently, Khoram et al. introduced b-splines to control minimal feature size of irregular structures [78]. Instead of filtering out small features directly, this method transfers the design domain to lower space dimension composed of b-spline functions, which damps small features internally. In the example of PBS design, the optimized device geometry is shown in Figure 4c, where density adjoint method with b-spline function is adopted. Within the footprint of 2.4 × 2.8 µm 2 , this optimized PBS splits TE and TM modes with over 90% transmission efficiency covering 420-nm wavelength range.

Comparison of Iterative Optimization Algorithms for Silicon Photonics Design
We have discussed three examples of inverse designed PBSs via different initial structures with corresponding iterative optimization algorithms in the above three subsections. The PBS based on empirical structure has high transmission efficiency (over 85% within 420 nm wavelength range) for both TE and TM modes and the critical size is under the control of designers. However, it also has the largest footprint (approximately 7 × 4 µm 2 ) and the lowest DOF (only 3). It takes about one week to optimize this device on two 12-core central processing units (CPUs). Profound prior knowledge about the coupling theory is also required and a skillful initial structure has to be chosen, otherwise, the device may not work at all. The PBS based on QR-code like structure has ultra-compact size (2.4 × 2.4 µm 2 ) as well as high DOF (i.e., 400). Its design process does not need much prior knowledge but the optimization time is long (~140 design hours) as each QR-code has to be verified during DBS. This time-consuming optimization process limits both the size of footprint and the number of DOF, so the transmission efficiency of DBS optimized PBS is not very high. The peak efficiency of optimized PBS is~80% and its 1 dB bandwidth is 83 nm. The irregular PBS optimized via adjoint method is also ultra-compact (2.8 × 2.4 µm 2 ), and it has 16,800 DOF due to its ultra-high resolutions (i.e., 20 nm). After 64 h of optimization on two 12-core CPUs, the final irregular PBS has transmission efficiency over 90% covering 420-nm wavelength range. The irregular PBS has a time-efficient optimization process, where all the parameters can be updated in one round via only two simulations (one forward simulation and one adjoint simulation). However, profound knowledge about light-material interaction process, as well as advanced mathematics, is required for gradient calculations. Furthermore, gradient-based algorithms are easy to fall into local optimal. For example, the resolution of an irregular structure is usually 20 nm, which may result in minimal features too small to be fabricated.
Apart from the examples of PBS discussed above, inverse design based on these methods has been widely used for other silicon photonic devices as shown in Table 1. Empirical structure-based devices are large but they have high performance and are easy to process since their critical sizes are easy to control. QR-code like structures have compact footprints and controllable minimal feature sizes at the cost of some decrease in optical performance. Irregular devices have the most appealing performance, but minimal feature size control is in great need for the convenience of device fabrication.
Some researchers also combined different structures and algorithms to solve the limitations of certain methods for the design of a single device. For example, Xu et al. combined QR-code like structure into an empirical waveguide crossing structure to work as a wavelength filter [35]. Xie et al. proposed a global optimization method combining GA and annealing algorithm for optimizing on-chip twisted light emitter [81].

Deep Neural Networks Assisted Nanophotonics Design for Silicon Platform
Traditionally, the optical response of a given silicon geometry is fulfilled via EM simulations, which are accurate but time-consuming. If more than one device with similar structures and optical responses is needed for a specific scenario like arbitrary power splitter, different mode converter, wavelength filter or vertical grating for different wavelength bands, methods for iterative optimization in Section 2 will need to be performed multiple times for different target responses. It is important to find the relationship between optical response and geometric parameters instead of optimizing one by one. However, this implicit relationship is hard to find. Fortunately, DNNs open up a new way for silicon photonics design. DNNs have gained great enthusiasm from research community for their ability to find the complex and nonlinear relationships between input and output data [82].
Recently, DNNs have been introduced to accelerate the design process of silicon photonics. As shown in Figure 5, designer abstracts geometric parameters x from a certain device structure like empirical structure, QR code-like structure or irregular structure. Optical response is sampled as y like EM field distribution or optical spectrum which describes the designer's interested phenomena. Usually several groups of device geometric parameters [x 1 , x 2 , . . . , x n ] are corresponding to a certain optical response y [83]. The mapping from device geometric parameters x to optical response y is called a forward model [84][85][86][87][88][89][90], while the inverse model describes the mapping from optical response to device geometric parameters [91][92][93]. Both of above-mentioned mapping types have been widely performed through DNNs. tween optical response and geometric parameters instead of optimizing one by one. However, this implicit relationship is hard to find. Fortunately, DNNs open up a new way for silicon photonics design. DNNs have gained great enthusiasm from research community for their ability to find the complex and nonlinear relationships between input and output data [82].
Recently, DNNs have been introduced to accelerate the design process of silicon photonics. As shown in Figure 5, designer abstracts geometric parameters from a certain device structure like empirical structure, QR code-like structure or irregular structure. Optical response is sampled as like EM field distribution or optical spectrum which describes the designer's interested phenomena. Usually several groups of device geometric parameters [ , , … , ] are corresponding to a certain optical response [83]. The mapping from device geometric parameters to optical response is called a forward model [84][85][86][87][88][89][90], while the inverse model describes the mapping from optical response to device geometric parameters [91][92][93]. Both of above-mentioned mapping types have been widely performed through DNNs. Many advanced DNNs have already been considered for the design of metasurface [94][95][96][97][98][99][100][101][102][103] since many similar components with different diffraction angles are needed in this case. Advanced DNN architectures may also enable the design of integrated silicon photonic devices. In this section, we will discuss how DNNs can be trained as forward models and inverse models to assist the design of nanophotonics on silicon platform. For each kind of DNNs, we will demonstrate a few examples as well as give basic mathematical background of some fundamental DNN types.

Training Discriminative Neural Networks as Forward Models
A forward model, which takes the device geometric parameters as features and takes optical response as label , builds a mapping relation as in Equation (14): where (•) is the complicated nonlinear function constructed by the DNN. The mapping from device geometric parameters to optical response is a kind of one-to-one problem, as a specific device geometry corresponds to only one optical response. Discriminative neural networks excel at modeling one-to-one mapping, such as regression or classification. In this section, we will focus on multi-layer perceptron (MLP) and convolutional neural network (CNN) for this kind of discriminable problem.  Many advanced DNNs have already been considered for the design of metasurface [94][95][96][97][98][99][100][101][102][103] since many similar components with different diffraction angles are needed in this case. Advanced DNN architectures may also enable the design of integrated silicon photonic devices. In this section, we will discuss how DNNs can be trained as forward models and inverse models to assist the design of nanophotonics on silicon platform. For each kind of DNNs, we will demonstrate a few examples as well as give basic mathematical background of some fundamental DNN types.

Training Discriminative Neural Networks as Forward Models
A forward model, which takes the device geometric parameters as features x and takes optical response as label y, builds a mapping relation as in Equation (14): where F(·) is the complicated nonlinear function constructed by the DNN. The mapping from device geometric parameters x to optical response y is a kind of one-to-one problem, as a specific device geometry corresponds to only one optical response. Discriminative neural networks excel at modeling one-to-one mapping, such as regression or classification.
In this section, we will focus on multi-layer perceptron (MLP) and convolutional neural network (CNN) for this kind of discriminable problem.

Multi-Layer Perceptron
The origin of MLP dates back to 1989 when Cybenko et al. proposed a feed-forward network with a single hidden layer containing a finite number of neurons. Under mild assumptions on the activation function, this neural network could approximate almost any continuous function with enough training data samples [104].
We take [85] for example to illustrate the process of constructing and training an MLP to surrogate EM simulation of grating couplers. As shown in Figure 6, the geometric parameters (grating pitch, duty cycle, fill factor, fiber angle and polarization) are encoded as data feature x = [x 1 , x 2 , x 3 , x 4 , x 5 ]. Peak efficiencies and corresponding wavelengths for TE and TM modes are marked as data label y = [y 1 , y 2 , y 3 , y 4 ]. The MLP adopted by Gostimirovic et al. consists of one input layer (5 neurons), three hidden layers (100, 50, 50 neurons) and one output layer (4 neurons), where the output value of each non-input neuron is determined by all the neurons in the last layer. Taking the first neuron in the third hidden layer x 3 1 for example, the weighted summation of neurons in the second hidden layer is shown in Equation (15): where w 3 i , b 3 are the ith weight and bias in the third hidden layer, respectively. The weighted summation is activated to get the output value of neuron as shown in Equation (16): where f (·) is the activation function which normalizes the output of each neuron to a finite range [0, 1] or [−1, 1]. Activation functions are supposed to be computationally efficient, for there may exist abundant neurons in a DNN [105]. In this example, Rectified Linear Units (ReLu) activation function is adopted, which is expressed as in Equation (17): l. Sci. 2021, 11, x FOR PEER REVIEW 12 of 25 However, for increasing DOF, an MLP should be deep enough to map the data features to the data labels and the required training data samples will also increase. For example, Tahersima et al. trained an eight-layers ResNet with 20.000 data samples to predict the output spectra of power splitter with QR-code like structure [92]. Apart from increasing the layers of an MLP, reducing the dimension of data features is also an efficient way to map high-dimensional data features to the data labels, which could also alleviate the requirement for a large number of training data samples. Melati et al. have demonstrated that principal component analysis (PCA) is capable of compressing high-dimensional geometric parameters to a sub-space without useful information loss [107]. To avoid the massive data samples requirement for devices with very high DOF like irregular structures, Liu et al. adopted Fourier transform together with a high frequency filter to dimin- Through this feed-forward propagation, the relationship between input data feature x and output label y is shown in Equation (18): where W : W 1 , W 2 , W 3 , W 4 and b : b 1 , b 2 , b 3 , b 4 consists of weights and biases in all the layers. These weights and biases are determined by the training of MLP with known data samples, which is accomplished by backpropagation. Firstly, the loss function in the form of mean square error (MSE) is defined as in Equation (19): where y p i is the predicted value of ith output neuron and y i is the ith label of a data sample. Apart from MSE, the loss function can be expressed, e.g., in the form of cross entropy loss function [106]. The loss function L(y p , y) is optimized to its minimal by gradient descent, like stochastic gradient descent (SGD) or batch-size SGD. After the calculation of gradients for the tth sample, weights are updated as in Equation (20): where α is the learning rate to control the update speed. Bias b is updated in a similar way. In the example of reference [85], 9190 data samples are generated via 2D FDTD and separated with a ratio of 85:15 for training and validation. The trained MLP has 93.2% prediction accuracy and is 1830 times faster compared with FDTD simulation, hence this trained MLP is beneficial to surrogate EM simulations of grating couplers. This trained MLP is easy to connect with optimization algorithms for rapid development of new grating couplers. Gostimirovic et al. demonstrated that their trained MLP with brute-force sweeping could find an optimal design 61 times faster than FDTD simulations with PSO. However, for increasing DOF, an MLP should be deep enough to map the data features to the data labels and the required training data samples will also increase. For example, Tahersima et al. trained an eight-layers ResNet with 20.000 data samples to predict the output spectra of power splitter with QR-code like structure [92]. Apart from increasing the layers of an MLP, reducing the dimension of data features is also an efficient way to map high-dimensional data features to the data labels, which could also alleviate the requirement for a large number of training data samples. Melati et al. have demonstrated that principal component analysis (PCA) is capable of compressing high-dimensional geometric parameters to a sub-space without useful information loss [107]. To avoid the massive data samples requirement for devices with very high DOF like irregular structures, Liu et al. adopted Fourier transform together with a high frequency filter to diminish high-frequency geometry information and then use this compressed data in Fourier domain to train an MLP as a forward model [108].

Convolutional Neural Network
As discussed in Section 3.1.1, MLP it is hard to process devices with high-DOF structures such as QR-code and irregular structures without introducing additional methods to reduce data dimensionality. Fortunately, CNN, another popular DNN, is innately capable of processing high-dimensional date features while requiring smaller training data samples than an MLP under the same conditions. Inspired by human visual information procedure illustrated by David (1981), CNN processes the data layer by layer from specific to abstract and has been widely used for image classification, separation and recognition [109]. A complete CNN usually consists of three types of layers, i.e., convolutional layers, pooling layers and fully-connected layers. Convolutional layers extract local features of an input data matrix, while pooling layers, also called down-sampling layers, reduce the data dimensionalities greatly and avoid overfitting. The fully-connected layers are similar to an MLP, which project extracted data features to the data labels.
We take reference [86] as an example to illustrate the process of building and training a CNN as a forward model to predict output spectra of a QR-code like power splitter. As shown in Figure 7a, this power splitter consists of one inverse tapered input waveguide, two tapered output waveguides and a 2.6 × 2.6 µm 2 middle square design area which is separated into 20 × 20 pixels. Each of these pixels has a binary state as etched (coded as 1) with a diameter of 90 nm circle and not-etched (coded as 0). This 0-1 coded 20 × 20 matrix works as the input data features X of CNN, which goes through three convolutional layers with ReLu activation. In a convolutional layer, different convolutional kernels convolve with input matrix to generate different sub-matrices as shown in Figure 7b. After convolutional layers, the extracted features are condensed via pooling layers and flattened into a vector. This vector containing extracted features is then projected to data labels y via fully-connected layers with Sigmoid activation function. The data labels are vectors with samples of transmission spectra as their elements shown in Figure 7c. After the training, this CNN predicted transmission performance with 85% accuracy compared with simulation results, which was much better than 16% accuracy of a 4-layered MLP using the same training data samples. Thus, a well-trained CNN is preferred for predicting the optical performance of devices with QR-code structures. with samples of transmission spectra as their elements shown in Figure 7c. After the training, this CNN predicted transmission performance with 85% accuracy compared with simulation results, which was much better than 16% accuracy of a 4-layered MLP using the same training data samples. Thus, a well-trained CNN is preferred for predicting the optical performance of devices with QR-code structures. A simple example of convolutional and maximal pooling processes is demonstrated in Figure 7d. As shown in the red square, the pre-defined convolutional kernel is multiplexed to pixels in the original matrix to get a new value in sub-matrix. Moving the convolutional kernel along the original 5 × 5 matrix, we get a 3 × 3 sub-matrix. To further decrease the matrix dimension, maximal pooling operation is performed, where only the maximum value is picked up to represent the sub-matrix as marked by orange squares. These convolutional and pooling layers are widely used for high-dimensional image data processing and are easy to train.

Training Generative Deep Neural Networks as Inverse Models
A well-trained inverse model can predict the component geometries directly according to target optical responses without any iterative optimization procedures. However, it is difficult to build and train a DNN as an inverse model because a specific optical response corresponds to different component geometries, which is hard to model via discriminative neural networks. Luckily, generative neural networks can solve this one-tomany projection [110]. Instead of fitting data features to data labels directly, generative neural networks model the underlying distributions of data features (device geometry parameters) and data labels (optical response). However, training of generative neural networks is hard to converge without enough data samples. To ease the burden of massive requirement of training data samples, semi-supervised generative neural networks like conditional variational autoencoder (CVAE) and conditional generative adversarial network (CGAN) have been proposed for building inverse models [91,103,111]. Furthermore, combined with an EM simulator such as FDTD, unsupervised generative neural networks have also been trained as inverse models, where prepared training data samples are not demanded [99,100]. A simple example of convolutional and maximal pooling processes is demonstrated in Figure 7d. As shown in the red square, the pre-defined convolutional kernel is multiplexed to pixels in the original matrix to get a new value in sub-matrix. Moving the convolutional kernel along the original 5 × 5 matrix, we get a 3 × 3 sub-matrix. To further decrease the matrix dimension, maximal pooling operation is performed, where only the maximum value is picked up to represent the sub-matrix as marked by orange squares. These convolutional and pooling layers are widely used for high-dimensional image data processing and are easy to train.

Training Generative Deep Neural Networks as Inverse Models
A well-trained inverse model can predict the component geometries directly according to target optical responses without any iterative optimization procedures. However, it is difficult to build and train a DNN as an inverse model because a specific optical response corresponds to different component geometries, which is hard to model via discriminative neural networks. Luckily, generative neural networks can solve this one-to-many projection [110]. Instead of fitting data features to data labels directly, generative neural networks model the underlying distributions of data features x (device geometry parameters) and data labels y (optical response). However, training of generative neural networks is hard to converge without enough data samples. To ease the burden of massive requirement of training data samples, semi-supervised generative neural networks like conditional variational autoencoder (CVAE) and conditional generative adversarial network (CGAN) have been proposed for building inverse models [91,103,111]. Furthermore, combined with an EM simulator such as FDTD, unsupervised generative neural networks have also been trained as inverse models, where prepared training data samples are not demanded [99,100].

Conditional Variational Autoencoder
Variational autoencoder (VAE), a kind of generative neural network, generates data features through a decoder with sampled normal distribution input [112]. The input of a decoder is supposed to have the same normal distributions with the output of an encoder. In the training process, an encoder projects original data features to their unique mean values and variances, which are supposed to obey normal distributions, while a decoder tries to reconstruct original data features through potential variables sampled from normal distributions. The VAE is trained by minimizing the difference between real features from data samples and generated features from VAE and by ensuring that the encoded data obeys normal distributions. After the training process, different groups of new data features can be generated by sampling different potential variables from the normal distributions and feeding them into encoder. However, VAE can only generate data features (e.g., device geometry) similar to their training data samples, which is not suitable for an inverse design problem where a data label (target optical response) is desired. Fortunately, CVAE is a kind of model which melts label information into a VAE so that the trained model could generate data features according to data labels.
We take reference [91] as an example to illustrate how to train a CVAE as an inverse model to generate component geometry according to user-defined optical spectrum. The initial power splitter structure is similar to the structure in Figure 7 with 20 × 20 pixels in the middle square design area but the diameters of etched holes are tunable to increase the DOF of QR-code structure. 20 × 20 data features X with values ranging from 0 to 1 are used to represent these holes, where the variables with value less than 0.3 represent no hole and variables with value [0.3~1] correspond to etched holes with diameters [44~77 nm]. As shown in Figure 8, features X from real data samples are encoded through convolutional layers and fully-connected layers into mean values µ and variances σ. The latent vector z which are sampled from Gaussian distributions described by µ and σ are concatenated with coded optical response y to work as the input of autoencoder. After two de-convolutional layers and pooling layers, a geometry similar with training data samples is generated. The aim of the training process is to diminish the difference between original and generated geometric parameters and to decrease the Kullback-Leibler (KL) divergence between the latent z and the Gaussian prior. Prepared 15,000 training data samples, including some semi-optimized power splitters with different splitting ratios, are used to train this CVAE. The trained CVAE then generates 1000 new geometrical patterns according to different splitting ratios within seconds, which are validated by FDTD and marked with labels to train the CVAE again for better performance. This trained CVAE is able to generate power splitters with 90% efficiencies for given splitting ratios. decoder is supposed to have the same normal distributions with the output of an encoder. In the training process, an encoder projects original data features to their unique mean values and variances, which are supposed to obey normal distributions, while a decoder tries to reconstruct original data features through potential variables sampled from normal distributions. The VAE is trained by minimizing the difference between real features from data samples and generated features from VAE and by ensuring that the encoded data obeys normal distributions. After the training process, different groups of new data features can be generated by sampling different potential variables from the normal distributions and feeding them into encoder. However, VAE can only generate data features (e.g., device geometry) similar to their training data samples, which is not suitable for an inverse design problem where a data label (target optical response) is desired. Fortunately, CVAE is a kind of model which melts label information into a VAE so that the trained model could generate data features according to data labels. We take reference [91] as an example to illustrate how to train a CVAE as an inverse model to generate component geometry according to user-defined optical spectrum. The initial power splitter structure is similar to the structure in Figure 7 with 20 × 20 pixels in the middle square design area but the diameters of etched holes are tunable to increase the DOF of QR-code structure. 20 × 20 data features with values ranging from 0 to 1 are used to represent these holes, where the variables with value less than 0.3 represent no hole and variables with value [0.3~1] correspond to etched holes with diameters [44~77 nm]. As shown in Figure 8, features from real data samples are encoded through convolutional layers and fully-connected layers into mean values and variances . The latent vector which are sampled from Gaussian distributions described by and are concatenated with coded optical response to work as the input of autoencoder. After two deconvolutional layers and pooling layers, a geometry similar with training data samples is generated. The aim of the training process is to diminish the difference between original and generated geometric parameters and to decrease the Kullback-Leibler (KL) divergence between the latent and the Gaussian prior. Prepared 15000 training data samples, including some semi-optimized power splitters with different splitting ratios, are used to train this CVAE. The trained CVAE then generates 1000 new geometrical patterns according to different splitting ratios within seconds, which are validated by FDTD and marked with labels to train the CVAE again for better performance. This trained CVAE is able to generate power splitters with 90% efficiencies for given splitting ratios.

Conditional Generative Adversarial Network
Different from VAE which encodes original data features into normal distributions and decodes these normal distributions to reconstruct original data features, a generative adversarial network (GAN) uses a kind of game theory for training [113]. A GAN consists

Conditional Generative Adversarial Network
Different from VAE which encodes original data features into normal distributions and decodes these normal distributions to reconstruct original data features, a generative adversarial network (GAN) uses a kind of game theory for training [113]. A GAN consists of a generator and a discriminator which compete with each other. With random noise input, generator produces a group of generated data features (e.g., device geometry parameters) while the discriminator judges whether this group of generated data features is real or not. The training of GAN is a game process where the generator intends to fool discriminator with generated data features and the discriminator tries to distinguish generated features from the original ones. Firstly, the discriminator is trained several times to distinguish generated features from original features. Then, the discriminator works as a supervisor to train the generator until the generator can produce data features to deceive the discriminator itself. The training procedures of discriminator and generator are alternatively repeated several times until the generator can produce almost "real" data features. Like a CVAE, a CGAN also merges data labels into the model to generate new data features. Random noise and data labels are concatenated and then applied as input to the generator.
We take reference [103] as an example to illustrate the process of training a CGAN as an inverse model to predict the irregular geometries of metagratings. The target optical properties are diffraction angle θ and diffraction wavelength λ. As shown in Figure 9, the label y which describes the optical properties is combined with random noise to work as the input of generator. The random noise n is utilized to improve the robustness of the generator as well as to generate different irregular geometries. The generator, which consists of two fully-connected layers, four de-convolutional layers and a Gaussian filter, produces new irregular geometries. The discriminator consists of convolutional layers and fully-connected layers. It works as a supervisor to determine whether a generated irregular geometry shares the same properties with the prepared irregular geometries. The generator and discriminator are trained alternately. 600 high-resolution irregular half-optimized metagratings are used as the initial training data samples. After training, the CGAN generates thousands of irregular metagratings within seconds. They are used as new training data samples to retrain CGAN for better performance. Most of the generated metagratings from the trained CGAN have efficiencies over 75%, much better than randomly generated metagratings whose best efficiency is less than 30%. The efficiency of irregular metagratings generated by CGAN together with a few iterative optimization procedures could reach up to over 90%, comparable to that of iterative-only optimization methods. input, generator produces a group of generated data features (e.g., device geometry parameters) while the discriminator judges whether this group of generated data features is real or not. The training of GAN is a game process where the generator intends to fool discriminator with generated data features and the discriminator tries to distinguish generated features from the original ones. Firstly, the discriminator is trained several times to distinguish generated features from original features. Then, the discriminator works as a supervisor to train the generator until the generator can produce data features to deceive the discriminator itself. The training procedures of discriminator and generator are alternatively repeated several times until the generator can produce almost "real" data features. Like a CVAE, a CGAN also merges data labels into the model to generate new data features. Random noise and data labels are concatenated and then applied as input to the generator. We take reference [103] as an example to illustrate the process of training a CGAN as an inverse model to predict the irregular geometries of metagratings. The target optical properties are diffraction angle and diffraction wavelength . As shown in Figure 9, the label which describes the optical properties is combined with random noise to work as the input of generator. The random noise is utilized to improve the robustness of the generator as well as to generate different irregular geometries. The generator, which consists of two fully-connected layers, four de-convolutional layers and a Gaussian filter, produces new irregular geometries. The discriminator consists of convolutional layers and fully-connected layers. It works as a supervisor to determine whether a generated irregular geometry shares the same properties with the prepared irregular geometries. The generator and discriminator are trained alternately. 600 high-resolution irregular half-optimized metagratings are used as the initial training data samples. After training, the CGAN generates thousands of irregular metagratings within seconds. They are used as new training data samples to retrain CGAN for better performance. Most of the generated metagratings from the trained CGAN have efficiencies over 75%, much better than randomly generated metagratings whose best efficiency is less than 30%. The efficiency of irregular metagratings generated by CGAN together with a few iterative optimization procedures could reach up to over 90%, comparable to that of iterative-only optimization methods. Figure 9. Training process of a CGAN to design irregular metagratings.
Using a CGAN, Liu et al. further merged a pre-trained forward neural network with CGAN to calculate the optical response of a generated device [114]. The generator was trained with two aims: the first to diminish the difference between generated structure and real structures under the supervision of discriminator, and the second to reduce the difference between targeted optical response and generated optical response which is cal- Figure 9. Training process of a CGAN to design irregular metagratings.
Using a CGAN, Liu et al. further merged a pre-trained forward neural network with CGAN to calculate the optical response of a generated device [114]. The generator was trained with two aims: the first to diminish the difference between generated structure and real structures under the supervision of discriminator, and the second to reduce the difference between targeted optical response and generated optical response which is calculated via pre-trained forward neural network. In this way, apart from making the generated device geometry similar to real device structures, the transmission efficiency of generated device is also supposed to be similar to the target optical response.

Unsupervised Generative Neural Network
In addition to learning from the data samples, DNN can also work as a trainable function and merge with EM-simulator based optimization algorithms. In this way, training data samples are not required.
We take reference [100] as an example to illustrate the process of building and training an unsupervised neural network as an inverse model to generate irregular metagratings. The irregular geometries of metagrating are also produced by a well-trained generator for the given optical response as well as random noise. As shown in Figure 10, the device geometry G is the function of diffraction wavelength λ, angle θ and random noise n as shown in Equation (21): where f 1 (·) is the function described by the generator with trainable parameters matrix W. With known metagrating geometry G, the optical response E G can be calculated by forward EM simulation as in Equation (22): where f 2 (·) is the non-analytical function computed by EM simulation. The loss function is an analytical function defined by designer as in Equation (23): In addition to learning from the data samples, DNN can also work as a trainable function and merge with EM-simulator based optimization algorithms. In this way, training data samples are not required.
We take reference [100] as an example to illustrate the process of building and training an unsupervised neural network as an inverse model to generate irregular metagratings. The irregular geometries of metagrating are also produced by a well-trained generator for the given optical response as well as random noise. As shown in Figure 10, the device geometry is the function of diffraction wavelength , angle and random noise as shown in Equation (21): where (•) is the function described by the generator with trainable parameters matrix . With known metagrating geometry , the optical response can be calculated by forward EM simulation as in Equation (22): where (•) is the non-analytical function computed by EM simulation. The loss function is an analytical function defined by designer as in Equation (23): The aim of (•) is to diminish the difference between target optical response and generated optical response . To train this deep generator, back-propagation is applied as in Equation (24): The first term is an explicit function, which is easy to calculate analytically. The second term , which is an implicit function, can be solved through adjoint method with an inverse simulation as derived in Section 2. The last term , is the loss function of deep generator which can be trained via back-propagation similar to other DNNs.
In this way, the generator is no longer trained by prepared data samples but with the assistance of EM simulation in a chain rule. After the training process, the generator is able to produce multiple metagratings with different geometric parameter sets for a target optical response.  The aim of f 3 (·) is to diminish the difference between target optical response E T and generated optical response E G . To train this deep generator, back-propagation is applied as in Equation (24): The first term is an explicit function, which is easy to calculate analytically. The second term ∂ f 2 ∂G , which is an implicit function, can be solved through adjoint method with an inverse simulation as derived in Section 2. The last term ∂ f 1 ∂W , is the loss function of deep generator which can be trained via back-propagation similar to other DNNs.
In this way, the generator is no longer trained by prepared data samples but with the assistance of EM simulation in a chain rule. After the training process, the generator is able to produce multiple metagratings with different geometric parameter sets for a target optical response.
Reinforcement learning which merges DNN with EM simulator is another approach without the need of training data samples. DNNs based on reinforcement learning generate new data features which are evaluated by EM simulator and fed back for the training of DNNs. Sajedian et al. utilized reinforcement learning for silicon-based color printing design [101]. Different from examples discussed above which make use of gradient of EM simulator, reinforcement learning makes decisions on the rewards of short-term and long-term gains.

Comparision of DNNs for the Design of Nanophotonics on Silicon Platform
In this section, we discussed some popular DNNs used as forward models and inverse models to accelerate the design process of silicon photonics. Here, we will classify and compare the advantages and disadvantages of these DNNs for the design of nanophotonics on silicon platform.
Forward model built via DNN serves as a boosted EM simulator for devices with similar initial structures to the training data samples. A well-trained forward model can predict optical response for given geometric parameters at almost no time cost. Combining forward model with iterative optimization algorithms, geometric parameters for a target optical response can be quickly optimized. MLP and CNN are both discriminative neural networks and have been widely used for the forward models. MLP is the simplest and most efficient DNN for geometries with low DOF such as empirical structures. With increasing DOF, both the number of training data samples and the depth of MLP structure have to be increased greatly. Data dimensionality reduction approaches can also be used to map high-DOF devices to low data dimension, but tricky data preprocessing is needed in that case. Another method is to merge data preprocessing directly into DNNs like CNN, whose convolutional layers and pooling layers are innately capable of processing high-dimensional data features. CNN is widely used for components with QR-code like structure or irregular structure.
Inverse models are more attractive for that they generate component geometry parameters for a target optical response almost without iterative optimization processes involved. However, an inverse model is much more difficult to build and train. Firstly, one optical response corresponds to not only one group of geometric parameters. Secondly, optical response usually has lower dimensionality as compared to geometry parameters. Luckily, generative neural networks such as CVAE and CGAN are capable of projecting low dimension to high dimension. Both of them try to monitor the underlying distributions of geometric parameters and optical response. The difference is that CVAE gets the underlying distribution via encoder and CGAN utilizes a discriminator to judge whether generated geometric parameters share the same distribution with data samples. This structural difference also brings training difference: CVAE trains the encoder and decoder simultaneously while CGAN trains two DNNs as generator and discriminator in an alternate way. After training, with optical response and potential variables as input, a CVAE generates device geometric parameters via a decoder while a CGAN produces geometric parameters via a generator.
Inverse models based on unsupervised generative neural networks are also appealing because they do not require training data samples. However, the time-consuming EM simulator has to be inserted into the generative neural networks to guarantee that they obey physical laws. Once trained, unsupervised generative neural networks can also generate new geometric parameters according to target optical response without any simulator involved.

Prospective
As discussed above, both iterative optimization algorithms and DNNs play vital roles in the inverse design of compact and high-performance silicon photonics. However, there are also some problems for these design methodologies. In this section, we will discuss the challenges of existing design methodologies and future research directions of inverse design in silicon photonics.

Simulation Time Budget
Gradient-free algorithms such as heuristic algorithms and DBS are widely used in inverse design. However, these methods require a great number of EM simulations to find the parameters set for an acceptable FOM. Heuristic algorithms need to perform EM simulations several times the DOF at each iteration to find the parameters update directions. DBS as a brute-force searching scheme is mainly used for devices with QR-code structures. Each simulation will only update one pixel, thus the required number of simulations to update the whole QR code during one round is the same as DOF. Usually, several rounds of DBS are needed for a device to reach an acceptable FOM. Chen et al. have shown that it takes more than 5000 simulations to optimize a mode convertor with 1120 DOF [115]. With increasing DOF of devices, the time budget of required simulations will be extremely large. Therefore, they proposed a new method which combined DBS with gradient-based algorithms. In this way, the total number of simulations could be cut down at the cost of some decrease in performance. Improving the efficiency of gradient-free algorithms is a prerequisite for further applying them to the design of more complicated devices.
In addition to cutting down the number of EM simulations required for the iterative optimization, a time-efficient EM simulator can also reduce the time budget for inverse design. Even though 3D-FDTD is computationally intensive, the simulation speed scales with multiple processors. High-performance computing techniques including parallelization and distributed computing are already adopted for commercial [116] and open-source [117] FDTD solvers, allowing the designers to reduce simulation time by additional computing resources. Implementation of FDTD or FEM using graphics processing unit (GPU) has also been investigated for higher simulation speed [118,119]. Advances in these attempts will also help reduce the time budget for inverse design.

Local Optimum and Minimal Features
TO is capable of optimizing the free-form irregular devices with thousands of DOF. However, inappropriate initial structures may limit the final optical performance to a local optimal. It takes trials and errors before finding an appropriate initial structure for a specific target. This issue could be solved by designing initial structure according to the physical laws or automatically adjusting initial structures based on optimization results. More interestingly, some researchers combine topology optimization with DNNs for global optimization for silicon metagratings [100], which could also be introduced to the design of integrated silicon photonics.
Another emerging problem for free-form TO is that the optimized devices may have very small features which could be difficult for fabrication process. Apart from "soft" control methods such as filter-based or transform domain-based methods, "hard" control methods have also been proposed. For example, Wang et al. confined the device structure to be QR-code with controllable minimal feature size [120]. However, these "digital" devices constrain the minimal feature size at the cost of DOF. More efficient methods to avoid local optimum and control minimal feature size are still desired for gradient-based freeform optimizations.

Data Sample Issue
Training a fully supervised DNN as a forward model takes around 10,000 data samples, as examples in Section 3.2 suggested. These data samples are collected by time-consuming EM simulations, which may cost even more time than iterative optimization algorithms. Therefore, DNNs-based forward models could only show their advantages in scenarios where piles of similar devices are needed. Data dimensionality reduction methods like PCA and Fourier transform have been applied to ease the burden of data samples collection but the effect is not satisfactory. For PCA, optimized data samples are required to find the lower-dimensional sub-space. For Fourier transform, some information will be discarded by high-frequency filters.
Inverse models project low-dimensional data labels to high-dimensional data features, which require even more data samples compared to forward model using traditional DNNs. The number of data samples can be reduced to a few thousand or hundreds by using semi-supervised DNNs, such as CVAE or CGAN. However, these inverse models require optimized or semi-optimized data samples. Without high-performance training data samples, the generative neural networks are hard to generate devices with high optical performance. Lowering the demand on the number and quality of training data samples is a good direction for future research. Unsupervised learning for inverse models totally solves the problems of massive data samples collection. However, an EM-simulator must merge into the training process of unsupervised generative neural networks to guarantee that the unsupervised DNN obeys physical rules. Apart from the time-consuming training process where EM-simulator has to be involved, inverse models based on unsupervised DNNs are not as general as those based on supervised DNNs. In other words, a welltrained DNN with the help of an EM-simulator may not be suitable for devices with diverse optical properties because such DNN is trained towards the direction of a target optical response. Improving the versatility of unsupervised-learning-based inverse models is also an interesting and meaningful topic for further study.

Application of Inverse Design in Optical Neural Networks
Artificial intelligence, which mimics the human learning process, is capable of processing many problems in almost every field. With the increasing demands of higher computational resources for complex tasks, traditional electronics-based computation is reaching its limits. ONN is an appealing alternative to conventional electronic computation for handling complex tasks since optics can process parallel computations with much lower power consumption [3,12,[121][122][123]. There are two popular integrated ONN schemes. i.e., layered ONNs and "black-box" ONNs.

Layered ONNs
Layered ONNs mimic traditional layered feed-forward neural networks, which consist of convolutional layers or fully-connected layers. Taking fully-connected layers for example, the weights matrix W can be decomposed by singular value decomposition (SVD) as W = UΣV T , where U and V T are the unitary matrices and Σ is the diagonal matrix. It has been proved that the unitary matrix could be realized via all-optical devices [124,125] and the diagonal matrix could be achieved by attenuators. Shen et al. cascaded programmable Mach-Zehnder interferometers (MZI) for different unitary matrix designs [126], while Qu et al. utilized topology optimization algorithm to design the unitary matrix, which is much more compact than cascaded MZI [80]. In both of the above-mentioned ONN architectures, the optical weights matrices are connected to a personal computer for realizing nonlinear activation, as optics are linear in silicon platform when the intensity of input light is low. In fact, the computation speed of a trained ONN is several orders faster than the traditional electronic DNNs with comparable prediction accuracy. With optical activation function, the computation speed could be accelerated even more [127]. Inverse design of nonlinear material proposed by Hughes et al. can be used for designing such optical activation function [128]. Similarly, Chen et al. utilized phase change material (PCM) inserted in silicon platform to work as an optical switch [129] which can also be inserted into silicon platform for activation function design.

"Black-Box" ONNs
"Black-box" ONNs go beyond the paradigm of layered feed-forward networks. Through an inverse designed platform, "black-box" ONNs map input signal to desired output ports  [130]. The handwritten digits are modulated to optical signals and fed to the input waveguides of NNM. After interactions with NNM, the modulated optical signals are transferred to ten output waveguides, which represent digits from 0 to 9. The output waveguide ports with highest light intensity suggests the input digit to be such value. Their "black-box" ONN is shown to be much more compact than the layered ONNs as no layers or neurons are needed in this case. However, the "black-box" ONN is not scalable. For more difficult tasks, the footprint of "black-box" ONN needs to be increased, which makes the inverse design of NNM too time-consuming to be realized.
Integrating all optical neural networks on silicon platform is still an emerging and significant research topic. The silicon platform with large effective refractive index is compatible to CMOS technology, which makes the cheap and massive fabrication possible. With the assistance of inverse design, many compact components could be designed and integrated for complex ONNs, while energy consumption can be reduced comparing with electronic DNNs.

Conclusions
Silicon photonics has been widely researched due to its high refractive index, where light could be manipulated efficiently even in the subwavelength scale. Different inverse design methods have been proposed and improved to reach the design limit of silicon platform for different application scenarios with high computation efficiency. In this review paper, we summarized the iterative optimization methods as well as newly-proposed DNNs for the inverse design of silicon photonics. For iterative optimization methods, we introduced heuristic algorithms like GA, PSO, DBS and TO for different scenarios in the sequence of increased DOF. DNNs-assisted inverse design could generate multiple devices with similar design area and optical response several orders of times faster than iterative optimization algorithms, which is particularly useful for situations like arbitrary power splitters where many optical devices are needed. In addition, we highlighted open issues with existing inverse design methodologies such as the great time budget for gradient-free algorithms, the local optimum of gradient-based algorithms and the data burden of DNNs. Since ONNs are quite appealing with the growing demand of artificial intelligence in various fields, we also discussed the role of inverse design and its corresponding methodologies for realizing ONNs. Inverse-designed ONN would be an interesting topic for future research in silicon photonics.