Deep Mutual Learning-Based Mode Recognition of Orbital Angular Momentum

.


Introduction
Vortex light carries orbital angular momentum (OAM) due to its spiral phase distribution.Theoretically, OAM has an infinite number of eigenstates, which can greatly improve communication capacity and security [1,2].However, vortex light is affected by atmospheric turbulence distortion in the transmission process, and the OAM state is dispersed, which has a serious impact on the communication performance [3].The traditional OAM state detection mainly relies on complex optical elements, which are identified after preprocessing such as diffraction or interference [4].However, expensive device cost and complex optical system construction with limited detection capabilities conflict with the requirement of high-speed real-time communication.
With the application of deep learning in image processing, speech recognition, and natural language processing, optical vortex mode detection based on depth learning has also been studied extensively.Knutson proposed to detect OAM state based on the Deep Neural Network (DNN) [5], which uses 16 layers Deep Neural Network to detect 100 types of single-mode vortex beams, with an accuracy rate of more than 70%.In 2017, Doster et al. [6] studied the OAM state detection based on Convolutional Neural Network (CNN), which uses the Alexnet network to detect OAM state according to the high-resolution light intensity distribution of Bessel Gaussian beam at the focus.The scheme not only simplifies the structure of the receiver, but also has certain robustness to turbulence intensity, data, sensor noise and pixels.Zhang et al. used CNN as the demodulator of the multiplexed vortex beam and compared the performance of K nearest neighbor neural network, Bayesian classifier, Back-Propagation (BP) artificial neural network and CNN.Under different turbulence conditions, the detection performance of CNN is higher than that of other demodulators [7].They improved the original LeNet-5 network and proposed a decoder scheme that can realize OAM state detection and turbulence intensity detection simultaneously [8].To solve the problem of detecting similar distribution of optical vortex superimposed modes, the optical vortex mode detection method based on attention pyramid convolution neural network (AP-CNN) is proposed [9].
Deep learning requires large datasets and an extended period of time to train.Although the model accuracy is very high, the number of parameters and computation are also large.Therefore, the application of model compression is gradually popularized in both academia and industry, especially when it is applied to low resource devices such as mobile internet and the Internet of Things.How to get an efficient deep learning model to meet the real-time and low-power requirements of low resource devices has aroused the interest of many scholars.
In order to obtain an efficient deep learning model, research is generally carried out in two directions.One is the construction of efficient network modules, including the manual design lightweight models (such as MobileNet [10], ShuffeNet [11]), and the automated design of network based on neural architecture search [12] (NAS).The other is model compression and acceleration techniques, including pruning, quantification, convolutional kernel compression and knowledge distillation.In 2006, Bucilua et al. first proposed to transfer the knowledge learned from large-scale model training to small-scale models [13].In 2015, Hinton formally put forward the idea of knowledge distillation [14].FitNets algorithm was proposed by Romero et al., which introduced intermediate representation for the first time, directly matching the feature activation of teacher model and student model, so that the student model can imitate the global intermediate feature extraction ability of learning teacher model [15].Li et al. extract more discriminating features from teacher models through supervised learning, so that student models will pay more attention to these features, thus improving their performance [16].Yim et al. propose to use the FSP matrix of the teacher model to guide the training of the student model, focusing only on the relationship knowledge between different network layers of each sample [17].Focused on the relationship knowledge between data samples, the knowledge distillation based on the sample angle relationship and distance relationship was proposed in [18].In 2020, Bajestani et al. used the relationship knowledge of relevant tasks to imitate human vision and transfer the time dependence of teacher model to student model for target detection tasks [19].By comparing the objective function and learning the structural knowledge of the teacher model, the relationship between the correlation of structural feature knowledge and higher-order output is obtained in [20].Xu et al. make use of confrontation learning to optimize global prediction with the same structure of teacher model and student model [21].
Deep Mutual Learning [22] (DML) is a kind of online distillation, and distillation training is carried out by using result-based knowledge.Through peer-to-peer simultaneous learning between networks, the final training effect of each network is not only better than the training effect of individual learning, but also better than the training effect guided by the trained large-scale teacher model in the distillation of classical knowledge.As the DML algorithm only uses the logits distribution of each network in the training process, and only pays attention to the recognition accuracy, it is applicable to any size and structure of the network.Even heterogeneous networks composed of networks with different sizes can be learned through mutual distillation.
The free space optical vortex communication system has the requirements of terminal deployment and high-speed operation.In order to ensure the accuracy of OAM state detection, the detection network has a high complexity, and there is a serious contradiction between the actual demand and the algorithm.At the same time, the light intensity distribution of some different superposed vortex beams is very similar, and the inter-class dark knowledge is of great significance to improve the OAM state detection accuracy.Therefore, the OAM state detection technique combined with knowledge distillation proved to be of high research value.
Combining the classical knowledge distillation algorithm to detect the OAM state, a large-scale and high-precision OAM state detection network needs to be trained in advance as a teacher and then one-way transfer of knowledge to guide the training of student networks.The scheme is divided into two stages, which can not achieve end-to-end, and the actual operation is more cumbersome.Optical communication system needs to deal with different transmission conditions such as turbulence intensity and transmission distance, so it is difficult to apply the OAM state detection network that requires two-stage training.Therefore, online distillation is selected in this paper, and an optical vortex OAM state detection technique based on DML is proposed.In addition, because the number of networks in the mutual learning queue can be expanded to include multiple networks, the more networks in the queue, the better the network performance after training, which also provides a new direction for further improving the OAM state detection accuracy.
The remainder of this paper is organized as follows.In section II the principle of DML and the framework of OAM state detection based on DML are presented.The experiment and results discussions are presented in section III.Section IV is devoted to the conclusion of this paper.

Principle of Deep Mutual Learning
DML is a type of online distillation [14,22] that uses response-based knowledge for distillation training.DML consists of a group of untrained networks that are trained simultaneously.The final training effect of each network is not only better than that of individual learning based on traditional supervised learning but also better than the one-way guided training with an already trained large-scale teacher network in classical knowledge distillation.
The training set contains samples of M types, and each type contains N samples.Then, the training set can be represented as , and the corresponding set of category labels can be represented as . When the samples i x are input into the network 1 Θ , the output soft target (probability distribution) is where m z denotes the output of the softmax layer.The loss function of the DML algo- rithm consists of two parts: supervised loss and imitation loss.For the neural network 1 Θ of the multi-category image classification task, its supervised loss function is chosen as the cross-entropy loss function, defined as the cross-entropy loss between the predicted value and the true label as shown.( , )log( ( )) where ( , ) i I y m is defined as: 1, ( , ) 0, The imitation loss uses Kullback-Leibler (KL) scatter to quantify whether the network 1 Θ matches the network 2 Θ , so that the probability distributions of the outputs of the network 1 Θ and the network 2 Θ can be as similar as possible.To improve the general- ization performance of the neural network 1 Θ , the posterior probability of the neural net- work 2 Θ is used to help its training.The KL scatter 1 p to 2 p is as follows: In the KL scatter of The total loss function of the neural network 2 Θ is as follows: The DML algorithm can be extended to a larger number of networks.When the number of networks is K, for one of the networks, the remaining K−1 networks are its teachers.For one of the networks k Θ , its total loss function is as follows:

OAM Mode Recoginition Based on DML
In the OAM state detection task, AP-CNN [9] has good detection performance.The correct tag bit in the output probability distribution (soft labels) is close to 1, and other error tag bit values are very small.However, the probability corresponding to different misclassification tags may vary greatly.In the trained modal identification network, its generalization information is concentrated on those misclassification tag positions that are close to zero.We have selected the simulation results of the Laguerre-Gauss vortex beam as an illustration.For the optical vortex with OAM = {4, − 4}, the probability of being wrongly classified as OAM = {2, − 6} is In the classical knowledge distillation, because the teacher model has completed training, the corresponding probability of misclassification tags is very small, and the student model cannot effectively learn this knowledge from the probability distribution output by the teacher model.Classical knowledge distillation makes the mapping curve of softmax layer more gentle by increasing the temperature (T) parameter.Under the condition that the relative size remains unchanged, it reduces the probability value corresponding to correct labels and increases the probability value corresponding to misclassified labels.The soft labels of the teacher model after temperature rise are shown in Figure 1.After the temperature is raised, it correctly identifies that the label bit corresponding to OAM = {4, − 4} is 0.6, the label bit corresponding to OAM = {6, − 6} is 0.1, and the label corresponding to OAM = {2, − 6} is 0.3.It can be seen that the light intensity distribution of OAM = {4, − 4} and OAM = {2, − 6} is relatively close, which is easy to cause errors.Taking soft labels as the training goal, it provides greater information entropy and higher learning rate for student models.In the DML algorithm, since each network is not trained in advance, it learns simultaneously.In the training process, the probability distribution of each network output is still relatively flat, and other networks can effectively learn these inter class dark knowledge through the imitation loss based on KL divergence.Its effect is the same as that of increasing the T parameter in the distillation of classical knowledge, so DML does not need to introduce the T parameter.The transfer process of dark knowledge between OAM superposition states and classes among networks is shown in Figure 2. The framework of the OAM mode recognition based on DML is shown in Figure 3.As can be seen from Figure 3, the two networks (AP-CNN and CNN) are selected in the DML queue.The selected CNN network contains three convolutional network layers and two fully connected network layers, as shown in Figure 4.Each convolutional network layer consists of a convolutional layer, a batch normalization layer, a maximum pooling layer, and the layers are connected by rectified linear units (Relu).Each layer uses a random deactivation unit (Dropout) with the probability set to 0.5.The convolutional layer of the first convolutional network layer contains 16 convolutional kernels with a size of 5×5; the convolutional layer of the second convolutional network layer contains 32 convolutional kernels with a size of 3×3; and the convolutional layer of the third convolutional network layer contains 64 convolutional kernels with a size of 3×3.All three convolutional network layers have a maximum pooling layer size of 2×2 with a step size of 2.  DML is an algorithm of mutual learning between networks.AP-CNN and CNN act as both teachers and students of each other.The ultimate goal of the paper is to obtain a small OAM state detection CNN with high detection accuracy.Therefore, we focused on introducing the one-way process of AP-CNN as a teacher and CNN as a student.The loss function of CNN consists of two parts.One part is the difference between the OAM state detection probability distribution of CNN and AP-CNN, and the other part is the difference between the OAM state detection result of CNN and the real OAM state label.The other process can be similar to unfolding.
In the specific training process as shown in Figure 4, the size of the light intensity distribution map of OAM beam is set to 128×128×3.After the three convolutional network layers, the size of output feature map obtained is 16×16×64.The feature map is fed to the subsequent two fully connected layers, the output of the first fully connected layer has 500 values, and the output of the second layer has 8 values, corresponding to 8 OAM mode categories.Then, the softmax function is used to activate the process and output the final OAM mode recognition result.The initial learning rate is set to 0.01, and the learning rate decreases by 10% every 10 iterations, and 50 epochs are trained.At the beginning of the training, the parameters of each network in the queue are randomly initialized, and the probability distribution of the output is uniformly distributed.For the CNN network, the supervised loss is large, and the imitation loss is small.The training is mainly guided by the supervised loss, and the parameters are updated mainly with the true labels as the training target.The OAM mode recognition performance of CNN and AP-CNN is continuously improved at this stage.However, because the parameters of AP-CNN and CNN are initialized differently and the representational knowledge learned during training process may be different, the output probabilities are not necessarily the same, and the imitation loss between AP-CNN and CNN keeps increasing.At this time, the CNN then learns knowledge from the logits distribution of the AP-CNN and combines the true labels and the logits distribution of AP-CNN network as the training target.

Simulation Data Set Construction
As shown in Figure 5

Analysis of OAM Mode Recognition Results Based on DML
CNN was trained directly under different turbulence intensity mentioned above, and CNN and AP-CNN were trained simultaneously by DML to make them mutual students and teachers.The variation in accuracy of the two networks with the training process in the two turbulent environments (  ) is shown in Figure 6.The improvement in accuracy of CNN in the DML queue for DML compared to separate learning is shown in Table 1.The DML-Ind in Table 1 refers to the difference in accuracy of OAM mode recognition between separate learning and DML.
, the accuracy of OAM mode recognition reaches 95.3% when CNN is trained alone, while the accuracy of the OAM mode recognition technique based on DML can reach 96.4%, which has an improvement of 1.1%.When

= ×
, the recognition accuracy is only 90.4% when the CNN is trained alone, while the accuracy of OAM mode recognition based on DML improves to 92.9%, which has an improvement of 2.5%.The improvement is more obvious compared with the weak turbulence case.Compared with independent learning, the accuracy of OAM state detection of large-scale network AP-CNN is still improved by 0.2% and 0.7% in two turbulent environments.
The network complexity includes two parts, spatial complexity and temporal complexity, with the number of model parameters representing the spatial complexity and the amount of model computation representing the temporal complexity.As can be seen from Table 2, the number of parameters of the small CNN network trained by DML is only 8.22M, the computation volume is only 52.50M, which is much smaller than that of the AP-CNN network, and the network complexity is significantly reduced as well.Combined with the results in Table 1, it can be seen that the OAM mode recognition based on DML CNN loses only 1.3% and 1.5% recognition accuracy compared to that of AP-CNN under the two turbulent transmissions, respectively.The experimental results show that the OAM mode recognition scheme proposed in our work effectively alleviates the contradiction between the complexity of the OAM mode recognition network and the deployment of the optical communication mobile while ensuring recognition accuracy.To further improve the accuracy of the OAM mode recognition of the small CNN network, the scalability of the number of networks in the DML queue can be fully exploited.The third network added is MobileNet, a lightweight network that is easier to train.The comparison of the accuracy of the small CNN with the training process is shown in Figure 7.In Figure 7, DML_2 indicates that the queue contains two networks (CNN and AP-CNN), and DML_3 indicates that the queue contains three networks (CNN, AP-CNN, and MobileNet).DML-Ind represents the difference between the DML_2 or DML_3 and the small network CNN.The results in Table 3 shows that when MobileNet is added to the DML queue, the accuracy of the small CNN network is further improved by 0.5% when

Conclusions
The paper proposes an OAM state detection technique based on DML, which provides a solution to the problem of the contradiction between the high complexity of the detection network and the deployment requirements of the terminal of the optical communication system to ensure the detection accuracy of the OAM.Firstly, the principle of DML is introduced.Then, the importance of inter class dark knowledge in the OAM state detection task is analyzed, and the framework of the OAM state detection based on DML is presented, including the selection of the network in the DML queue and the specific parameter setting and training process.The experiments are designed to compare the accuracy of OAM state detection of small network CNN trained by DML and independent learning, as well as the complexity of large network AP-CNN and small network CNN.The MobileNet network is added to the mutual learning queue, and the scalability of the number of networks in the mutual learning queue is used to further improve the detection accuracy.The results show that the OAM state detection based on DML proposed in this

,
and the probability of being wrongly classified as OAM = {6, − 6} is 3 5 10 − ×.Here, OAM = {4, − 4} indicates the superimposed modes of optical vortices with topological charge 4 and − 4, and the rest are defined anal- ogously.After being transmitted at different turbulence intensities and distances, the misclassification probability will constantly change accordingly.These minimum misclassification probabilities represent whether the light intensity distribution of OAM = {4, − 4} is closer to that of OAM = {2, − 6} or OAM = {6, − 6}.Because the value is very small and the impact on the objective function is very small, the dark knowledge can easily be lost in the training process.Knowledge distillation can effectively retain these inter class dark knowledge and ensure the accuracy of OAM state detection of the network.

Figure 1 .
Figure 1.Transfer of dark knowledge between classical knowledge distillation classes when detecting superimposed vortex light.

Figure 2 .
Figure 2. Transfer of dark knowledge between deep mutual learning classes when detecting superimposed vortex light.

Figure 3 .
Figure 3. Framework of OAM mode recognition technology based on DML.

Figure 4 .
Figure 4. Structure of CNN-based OAM mode recognition network.

.
, we select four pairs of OAM modes: {1, − 2} and {1, − 2, − 5}, {1, − 2, 3, − 5} and { − 2, 3, − 5}, {4, − 4} and {2, − 6}, {6, − 6} and {9, − 3}.In Figure 5, the light in- tensity distributions of multi-mode OAM beam in the four columns are similar in each column.The wavelength of OAM communication system is 0.6328μm .The comparison experiments are carried out under different turbulent conditions, such as different atmospheric refractive index structure constants ( 2 n C ) and the transmission distance.Six transmission distances are chosen: 500 m, 1000 m, 1500 m, 2000 m, 2500 m, and 3000 m.When simulating the atmospheric turbulence channel, the power spectrum inversion method is used to decimate the transmission distance to obtain ten phase screens with certain intervals.For each transmission condition, 2000 light intensity maps are generated for each OAM mode, and a total of 16,000 light intensity distribution map of OAM beam are included in the hybrid dataset.And it is divided into a training set and a test set with the ratio of 8:2 (12,800 images in the training set and 3200 images in the test set).

Figure 5 .
Figure 5.Light intensity distribution of similar multi-mode OAM beam.

Figure 6 .
Figure 6.Variation in the recognition accuracy with the training process of two networks.

2
that with the increase in the number of networks in the DML queue, the small CNN network can learn different knowledge from different networks, and the accuracy is further improved, which provides a new idea to get a high-precision OAM mode recognition network in the future.

Figure 7 .Table 3 .
Figure 7. Variation in the recognition accuracy of small CNN networks with different DML networks queue.Table 3. Comparison of the recognition accuracy of small network CNN with different DML networks queue.

Table 1 .
Accuracy of OAM mode recognition by DML and individual learning.

Table 2 .
Network complexity of small network CNN compared with large network AP-CNN.