1. Introduction
Hermite–Gaussian (HG) modes are solutions to the wave equation in Cartesian coordinates under paraxial conditions and represent the eigenmodes of square mirror optical resonators. They are commonly represented as HGnm, with n and m denoting the respective quantities of nodes along the x and y axes of the light spot. As n and m increase, the size of the light spot also increases. The higher-order HG modes not only carry richer information but also provide more degrees of freedom, making them uniquely advantageous in addressing the capacity crisis in optical communications and achieving the sustainable expansion of high-speed, high-capacity optical communication systems [
1,
2,
3,
4]. For example, the higher-order HG modes are utilized in optical tweezers technology for trapping and manipulating microscopic particles [
5]. In quantum communication, the higher-order HG modes enable the encoding of high-dimensional quantum states and thereby enhance both the channel capacity and security of quantum communication [
6]. In laser processing, the higher-order HG modes can be employed to achieve high-precision material processing [
7]. In precision measurements, the higher-order HG modes are capable of providing a higher accuracy in beam displacement measurements [
8].
Currently, there exist two primary approaches for generating higher-order HG modes: directly within the laser cavity or shaping the beam externally to the cavity. Although the direct generation approach is simple and convenient, it necessitates the integration of specialized optical components, which can complicate the structure and reduce stability [
9,
10,
11,
12]. Alternatively, beam shaping outside the cavity involves using optical diffraction elements, such as a spatial light modulator (SLM), to convert the fundamental mode into the higher-order HG modes by adjusting the loaded hologram, offering greater flexibility [
13,
14,
15,
16,
17].
Utilizing an SLM and phase retrieval algorithm enables the efficient generation of higher-order HG modes. The Gerchberg–Saxton (GS) algorithm is a commonly used phase retrieval method that iteratively optimizes the target phase distribution through Fourier transforms [
18]. However, the GS algorithm has inherent limitations such as multiple solutions and issues with the convergence speed. Several improved methods have been proposed to enhance the performance of the GS algorithm [
19,
20], but these modified algorithms often introduce new variables, increasing the complexity and requiring more experience to master their convergence characteristics and select the most suitable approach.
Owing to the development of deep learning, convolutional neural networks (CNNs) have showcased outstanding capabilities in various fields, such as computer vision, image processing, and pattern recognition [
21]. By utilizing large-scale datasets, CNNs can effectively learn to solve optical imaging problems and have been applied to many challenging tasks, including super-resolution imaging [
22], tomography [
23], and holographic imaging [
24]. Therefore, harnessing CNNs to generate the higher-order HG modes has become a topic worthy of further investigation.
In our previous work, we proposed a data-driven deep learning method that combines a CNN with the GS algorithm [
25]. This method leverages the CNN to train the relationship between the output and input intensity distributions in the GS algorithm. Subsequently, it predicts the corresponding input intensity distribution based on the target intensity distribution. The predicted input intensity distribution is then fed into the GS algorithm to obtain the hologram loaded onto an SLM. Using the proposed data-driven deep learning method, we generated the higher-order HG modes of different orders as well as optical fields with arbitrary intensity distributions. Compared with the GS algorithm, it can generate a target optical field of higher quality. This method employs a supervised learning strategy to train the neural network model, which heavily relies on a substantial volume of labeled data. However, the model generally exhibits a limited interpretability and generalization capability, thereby constraining its further practical application and development.
To overcome the limitation of the aforementioned method, which heavily relies on a vast amount of training data for the neural network training, a hybrid approach integrating a physical model with deep learning has emerged. This approach effectively combines the strengths of the physical model and deep learning. It not only leverages the prior knowledge and interpretability inherent in the physical model but also harnesses the powerful automatic feature extraction and generalization capabilities of deep learning. Currently, untrained neural networks combined with physical models are utilized for phase retrieval, and their superior performance has been experimentally validated [
26]. This paper proposes an innovative beam shaping method, namely the GS-CNN, which skillfully integrates the physical model of the GS algorithm with neural network technology. Specifically, the GS-CNN constructs its neural network architecture with the physical model of the Fourier transform between the front focal plane and the back focal plane, upon which the GS algorithm relies, as its core. This design not only fully leverages the CNN’s powerful capabilities for autonomous learning and adaptation but also cleverly incorporates the prior knowledge from the physical model as guidance, achieving a seamless integration of the strengths of both approaches. Through the GS-CNN, we have successfully generated higher-order HG modes of different orders, as well as optical fields with arbitrary intensity distributions. Experimental results demonstrate that, compared to the traditional data-driven CNN method, the GS-CNN achieves a significant improvement in efficiency and enhances the quality of the generated modes. Moreover, due to the incorporation of the physical model, the network model also possesses a certain degree of interpretability, providing robust support for the further development and application of the beam shaping technology.
3. Results and Discussion
U-Net is widely used in computational imaging because its encoder–decoder architecture, combined with skip connections, enables it to balance global and local information, achieve efficient multi-scale feature fusion, and possess a strong generalization ability. Moreover, it boasts advantages such as a high parameter efficiency and fast training convergence, making it suitable for a variety of imaging tasks. Therefore, U-Net is selected as the architecture of the GS-CNN.
In this paper, we propose an improved U-Net architecture. Its core innovation lies in the multi-scale phase feature extraction and dynamic fusion mechanism. The target intensity distributions (HG10, HG30, and CTGU) were all theoretically simulated and generated using Python 3.6.5. The specific operations are as follows: For the HG10 and HG30 modes, they were generated based on the theoretical expressions of the higher-order HG modes [
2]. In terms of data selection, for both the training set and the test set, two sets of the HG10, HG30, and CTGU intensity distribution with different spot sizes were, respectively, selected for computational analysis. Since the GS-CNN only requires a single target optical field intensity distribution as the input, there is no need to divide the training set into proportions in the traditional way. One of the sets of the HG10, HG30, and CTGU intensity distribution is selected for training, the other set of the HG10, HG30, and CTGU intensity distribution is used to validate the model’s generalization ability. In the encoder part, the model employs a four-stage progressive downsampling structure. Each stage consists of two 3 × 3 convolutional layers with channel numbers of 32, 64, 128, and 256 in sequence, accompanied by batch normalization and LeakyReLU activation (α = 0.2). This design can effectively extract multi-scale features ranging from local phase gradients (in shallow layers) to the full-field phase distribution (in deep layers). In particular, we introduce a residual connection in the third stage of the encoder. Specifically, the input of this stage is directly added to the output after being processed by two 3 × 3 convolutional layers (with 128 channels) to enhance the model’s ability to represent phase jump regions. The bottleneck layer adopts a full-resolution densely connected block with 512 channels. This module makes full use of the feature information by maintaining the full resolution and achieving dense connections. Then, it establishes a non-linear mapping relationship between the wrapped phase and the true phase through cascaded 1 × 1 and 3 × 3 convolutional layers. In the decoder part, the transposed convolution is used for 2× upsampling, and the channel attention-weighted fusion is performed with the feature maps of the corresponding layers in the encoder. We employed the RMSE between the target intensity distribution and the generated intensity distribution of higher-order HG modes as the loss function to guide the training and optimization of the neural network.
The channel attention-weighted fusion obtains channel weights by operations such as global average pooling and fully connected layer processing on the channel dimension of the feature maps and then multiplies the weights with the feature maps to achieve fusion. This design significantly improves the sharpness of phase edges. The final output layer uses an improved periodic activation function (with a period of 2π) to directly predict the unwrapped phase, avoiding the error accumulation problem in the phase unwrapping in traditional methods. The entire network adopts an end-to-end training strategy.
The training and testing of the neural network model were both carried out on computers equipped with the Ubuntu 24.04.1 LTS operating system. The accompanying hardware configuration includes an Intel Xeon Gold 6438Y processor, an NVIDIA RTX A6000 GPU with a memory capacity of 48 GB. We employed the PyTorch 2.5 deep learning framework and wrote the code in Python 3.6.5 based on this framework, utilizing the PyCharm 2023.1 platform for code development. The experimental parameters employed are as follows: the beam wavelength is 1064 nm, the SLM has a resolution of 1108 × 1108, and the waist radius of the incident Gaussian beam is 5 mm. We published our code at
https://github.com/zaishuiyifang123/HG (accessed on 25 July 2025).
The target intensity distributions are selected as the HG10 mode and HG30 mode.
Figure 3 illustrates the intensity distributions of the HG10 mode and HG30 mode generated by the GS-CNN, along with the required holograms.
Figure 3a shows the standard intensity distribution of the HG10 mode and HG30 mode,
Figure 3b presents the intensity distributions of the HG10 mode and HG30 mode generated by the GS-CNN, and
Figure 3c displays the required holograms to be loaded. As illustrated in
Figure 3, the GS-CNN is capable of generating an intensity distribution that closely approximates the standard one. It can be observed that as the order of the generated modes increases, the mode distribution becomes more complex, and, accordingly, the required hologram also becomes more intricate. In the holograms, different gray values represent the phase value. In addition, the intensity distribution of the HG10 mode and HG30 mode can be revealed in the hologram. This occurs because, during the process of optimizing the phase distribution to match the target intensity, the algorithm inadvertently embeds modulation features related to the target intensity distribution into the phase information [
27].
In our study, we opted for a resolution of 512 × 512 pixels.
Figure 4 illustrates the variation in the loss function value of the GS-CNN after 20,000 training epochs. As evident from
Figure 4, the loss function value tends to stabilize and reaches a near-minimal level after 10,000 training epochs. Consequently, we opted to set the number of training epochs at 10,000.
Next, we proceeded with a quantitative analysis. We compared the GS-CNN with the data-driven deep learning method [
25] through experimental simulations. For both methods, we conducted simulations employing the aforementioned neural network architecture and parameters.
Figure 5 shows the experimental results for the generated HG10 mode, the HG30 mode, and the abbreviation “CTGU” (representing our university). Taking the CTGU with a uniform light intensity distribution as an example, it illustrates that the GS-CNN can effectively generate arbitrary intensity distributions. To ensure a fair and objective evaluation of the mode quality generated by the two methods, we uniformly employed the same incident Gaussian beam, with a beam size set at 3 mm and a planar wavefront. The first column in
Figure 5 displays the standard intensity distribution of the generated modes. The second column shows the intensity distribution of the modes generated through data-driven deep learning, while the third column presents the intensity distribution of the modes generated by the GS-CNN. We use the RMSE, peak signal-to-noise ratio (PSNR), and structural similarity index (SSIM) to evaluate the quality of the generated modes. As can be seen from
Figure 5, compared with the HG10 mode, HG30 mode, and CTGU generated by the data-driven deep learning, the RMSE values of these three modes generated by the GS-CNN have improved by 0.01, 0.012, and 0.016, respectively. The PSNRs of these three modes generated by the GS-CNN have improved by 4.6 dB, 4.0 dB, and 2.3 dB, respectively. The SSIMs of these three modes generated by the GS-CNN have improved by 0.127, 0.054, and 0.057, respectively. From the results, it can be observed that as the spatial distribution of the generated mode becomes more complex, the quality of the mode generation gradually declines. The differences in the difficulty of generating different higher-order modes primarily stem from the combined influence of the mode complexity and neural network performance. Firstly, higher-order modes exhibit more complex spatial distributions and contain a greater number of nodes, necessitating more precise phase control. As illustrated in
Figure 3, the hologram required for the HG30 mode is significantly more intricate than that for the HG10 mode. Moreover, complex structures are more susceptible to noise and systematic errors, leading to an increase in generation difficulty. Secondly, when dealing with complex distribution modes, neural network architectures encounter bottlenecks in feature extraction and generalization. The more complex the distribution of the mode, the stronger the multi-scale feature fusion capability required to capture both local details and global structures.
As can be clearly seen from
Figure 5, the intensity distribution of the modes generated by data-driven deep learning exhibits a speckled pattern, while the intensity distributions of the modes generated by the GS-CNN are more uniform without speckles. The reason is that data-driven deep learning heavily relies on a large amount of labeled data. This purely data-oriented learning approach makes the model lack prior physical knowledge, and non-essential features, such as noise, in the training data are easily captured by the model. In contrast, the GS-CNN incorporates the physical model as a constraint into the training process. The prior knowledge provided by the physical model helps enhance the model’s generalization ability, enabling precise control over the mode generation and ultimately resulting in a more uniform intensity distribution.
A detailed analysis of the runtime of the two methods is conducted. The data-driven deep learning method utilizes 5000 pairs of training data, and its network training process takes approximately 49 h. In contrast, the GS-CNN does not require pre-training. Its overall computation time is mainly concentrated on the iterative optimization process between the neural network and the physical model. When iterating for 20,000 epochs, it takes about 6 min. This method successfully avoids the lengthy data collection and training procedures while ensuring a high reconstruction quality, fully demonstrating its unique advantages in terms of the computation time.
The data-driven deep learning method, as a traditional application of deep learning in the field of computational imaging, relies heavily on large-scale labeled datasets. During the training process, such methods require tens of thousands of input–output data pairs, continuously adjusting the weights and biases of the neural network to learn the mapping relationship from the input data to the output data. Secondly, due to the excessive reliance on the statistical patterns of the training data, when encountering new scenarios that differ significantly from the training data, their performance often declines sharply, leading to issues such as prediction errors or noise interference [
28,
29]. In contrast, the GS-CNN proposes a novel neural network design paradigm, namely a physics-enhanced deep neural network that does not require pre-training. Its core idea lies in directly integrating the physical model into the neural network architecture. Through the interaction between the physical model and the neural network, it eliminates the reliance on large-scale labeled data, enabling high-quality phase reconstruction with only a single image of the target light field intensity distribution as the input. This is achieved through the constraints of the physical model and the optimization of the neural network, ultimately yielding a high-quality light field intensity distribution. This not only significantly simplifies the process of data acquisition and processing but also enhances the model’s adaptability and robustness when facing new scenarios.
The integration of physical models endows the network with interpretability, enabling the GS-CNN to be trained with just a single intensity distribution and breaking free from the heavy reliance of traditional data-driven methods on massive amounts of labeled data. Compared with traditional iterative methods, such as the GS algorithm, the GS-CNN not only maintains physical interpretability but also accelerates the phase-solving process through neural networks, while generating modes of higher quality. Furthermore, in contrast to general supervised models, the GS-CNN does not require the construction of a large-scale dataset; it can complete training with just one set of data, significantly reducing data collection costs. However, the GS-CNN also faces certain challenges, such as the significant impact of the physical model’s accuracy on the reconstruction results and potential convergence issues in certain extreme cases. The GS-CNN highlights the significance of using physical models as constraints for neural networks and initially verifies the advantages of combining physical models with neural networks. However, current research still has shortcomings, lacking systematic sensitivity analyses targeting variations in beam parameters (covering aspects such as the beam size and wavefront phase structure) as well as noise levels (including detection noise, environmental interference, etc.). Based on this, subsequent work plans to conduct controlled variable experiments, such as adjusting the beam waist size and introducing Gaussian white noise, to quantitatively evaluate the model’s robustness against parameter perturbations, thereby providing stronger evidence for the reliability of this method in complex scenarios.
In comparison with the traditional phase encoding method that transforms a complex amplitude distribution function into a pure phase distribution function [
30,
31,
32], the GS-CNN achieves algorithmic innovation by integrating physical models with deep learning. It directly learns the generation rules of phase distributions through neural networks. In terms of application scenarios, the traditional phase encoding method offers distinct advantages in dynamic light field modulation, whereas the GS-CNN is more suited for high-precision light field generation compared to other hybrid approaches, such as Physics-Informed Neural Networks (PINNs). In terms of the physical model integration, the GS-CNN makes the physical process a core component of the network computation through architecture-level fusion (replacing traditional convolutional layers with Fourier transform layers from the GS algorithm), whereas PINNs only achieve indirect constraints by adding physical equation residual terms to the loss function. Regarding data dependency, the GS-CNN requires only a single target optical field intensity distribution for training, completely eliminating the need for paired datasets, while PINNs still require some labeled data to assist in the convergence of physical constraints. In terms of computational efficiency, the GS-CNN reduces the training time to 6 min (20,000 iterations), whereas the training efficiency of PINNs is greatly influenced by the complexity of physical equations. Additionally, the intermediate layer outputs of the GS-CNN (such as phase distributions) have clear physical meanings, supporting process traceability.