Lightweight Image Denoising Network for Multimedia Teaching System

: Due to COVID-19


Introduction
Traditional teaching requires students to learn knowledge through face-to-face methods.Although it has good effects, it has higher requirements for students in terms of time and space.To break these limitations, online education has been developed.It mainly depends on a multimedia system (platform) to complete teaching tasks.Also, obtained images in the multimedia system constitute important media for human-to-human interaction.However, these images often suffer from some challenges from noise caused by camera shake, hardware quality, weather [1], etc.After analyzing the process of collecting and disseminating relevant teaching resources, teaching images often suffer from challenges, i.e., noise from collection equipment.To address these mentioned drawbacks, image denoising techniques are also applied.
An image denoising technique is a classical low-level technique and has been applied in various fields, i.e., activities recognition [2] and remote sensing [3].For instance, an expected patch log likelihood (EPLL) [4] used a mixed Gaussian model to learn prior knowledge from many natural image blocks for image denoising.Also, block matching and three-dimensional filtering (BM3D) [5] utilized collaborative filtering on similar two-dimensional image blocks to remove noise.A weighted nuclear norm minimization (WNNM) algorithm can exploit an image's non-local self-similarity to extract more information for image denoising [6].Although these methods can restore images, they face some challenges.That is, they excessively rely on manual adjustment of parameters and complex parameters.Due to strong expressive ability, convolutional neural networks (CNNs) have obtained abilities of feature extraction.Thus, CNNs have been applied in the field of image denoising.For instance, a denoising convolutional neural network (DnCNN) first utilized convolution and residual learning operations to complete denoising work [7].
To suppress the influence of the background on noise, an attention mechanism is fused in a CNN to separate background and foreground to suppress noise [8].To address image denoising under complex scenes, a dynamic convolution is used in a CNN to achieve an adaptive denoiser, according to different noisy images [9].To obtain a better denoising effect, a combination of an omni-dimensional dynamic convolution and attention mechanisms is integrated into a CNN to enhance the expressive ability of a denoising network, which can enhance interaction quality of the multimedia teaching system between students and teachers.Le et al. uses two phases, i.e., a feature augmentation stage and a feature refinement stage, to design a CNN to extract more accurate structural information for image denoising [10].To reduce the complexity of a denoiser, Lin et al. simplified the residual spatial-spectral module and knowledge distillation to achieve a lightweight method to accelerate noise removal [11] Alternatively, a combination of a non-local algorithm and a residual CNN achieves a lightweight CNN to suppress noise [12].That is, we present a lightweight image denoising network as well as LIDNet for multimedia teaching systems.LIDNet uses a parallel sub-network to mine complementary information for image denoising.To achieve an adaptive CNN, a dynamic convolution based on kernel information and input channel number and output channel number fused into an upper network can automatically adjust parameters to achieve a robust CNN, according to different input noisy images.That also enlarges differences in network architecture, which can improve the denoising effect.To refine the obtained structural information, a serial network is set behind a parallel network.To extract more salient information, an adaptively parametric rectifier linear unit composed of an attention mechanism and a ReLU is used in LIDNet.Experiments show that our proposed method is effective in image denoising, which may also provide assistance for multimedia teaching systems.
The contributions of the proposed method can be summarized as follows: 1.
A dynamic convolution based on kernel information and input channel number and output channel number is used to adaptively mine more useful information, according to different input images.

2.
A combination of attention mechanism and ReLU is set behind each convolutional layer in addition to the final convolutional layer to enhance the same distributions of training samples for pursuing better denoising performance.

3.
Our denoising method is useful for enhancing the interaction quality of a multimedia teaching system between teacher and student.
The remainder of this paper is organized as follows.Section 2 lists related work about image denoising based on dual networks and dynamic networks.Section 3 provides detailed information of the proposed method.Section 4 presents analysis of our proposed method and results.Section 5 gives the conclusion of this paper.

Related Work 2.1. A Dual Network for Image Denoising
To extract complementary information, dual networks are developed in image denoising [13].For instance, Tian et al. [13] presented a dual denoising network with sparse mechanism as well as DudeNet to extract complementary information to enhance denoising effects.Alternatively, Bai et al. [14] achieved a dual network via encoder-decoder and channel attention architecture to extract local and non-local information for image denoising, where image spatial details and semantic information can be obtained by a criss-cross attention.To extract more information, Holla et al. [15] used edge information to design a CNN to capture high-frequency information in image denoising.Zhang et al. fused different masks into a CNN to facilitate complementary information to suppress noise [16].To mine more high-frequency information, Qiao et al. combined two different networks and a sharpening loss function to improve the quality of visual denoising images [17].Liu et al. used a wavelet decomposition technique to achieve a wide CNN to prevent vanishing and exploding gradient problems [18].To extract salient noise information, Chen et al. fused a CNN and a transformer to implement a parallel network to extract structural information and key information based on pixel relations for improving denoising effects [19].For medical noisy image denoising, Jiang et al. [20] used residual connections and dilated convolutions to achieve a heterogeneous dual network to mine more complementary information to suppress noise.According to the mentioned illustrations, we can see that dual networks are useful for image denoising.Inspired by that, we design a dual network architecture for image denoising in this paper.

Dynamic Networks for Image Denoising
To enhance the robustness of the image denoiser, a dynamic network is created [21].For instance, Song et al. [21] combined dynamic convolutions and residual learning operations into a CNN to dynamically adjust parameters to obtain a robust denoising network, according to different input images.Du et al. [22] exploited a dynamic attention mechanism to better extract salient information for image denoising.Alternatively, Shen et al. [23] fused a spatial module and dynamic convolution to obtain more spatial context information to obtain better denoising performance.Tian et al. [9] used dynamic convolution and wavelet transform to extract more useful information to improve denoising effects.According to the mentioned descriptions, we can see that dynamic convolution is effective in image denoising.Motivated by that, we use a dynamic convolution in this paper, according to different kernel and channel information.

Network Architecture
The proposed 17-layer LIDNet combines a parallel and series architecture.The parallel architecture is composed of a 6-layer block called the dynamic feature extraction block (DFEB) and a 6-layer block named complementary feature extraction block (CFEB).The series architecture contains an 11-layer block called the cascaded purification block (CPB), which is shown in Figure 1.DFEB uses a dynamic convolutional layer to adaptively extract structural information, including kernel information and channel information.To extract complementary information, CFEB use several stacked convolutional layers, BN and a combination of attention mechanism and activation function to extract complementary salient information.Also, a residual learning operation is used to connect the obtained information from a parallel network.To prevent over enhancement, a 11-layer CPB is designed behind the parallel network.To construct a clean image, a residual learning operation is used to act between an input image and output image of LIDNet.This process can be shown as Equation (1).
where I C represents an output of LIDNet, which is regraded to a denoised image.I N denotes the input noisy image, and LIDNet() expresses a function of LIDNet.DFEB, CFEB, and CPB stand for functions of DFEB, CFEB, and CPB, respectively.+ is a residual learning operation, which is also shown as ⊕ in Figure 1.Furthermore, the MSE loss function of LIDNet is introduced in Section 3.2.

Loss Function
To fairly compare with the famous denoising benchmark of DnCNN, a mean squared error (MSE) [24] is chosen as the loss function to train LIDNet.Specifically, MSE uses pairs of I i N , I i C (1 ≤ i ≤ n) to train our LIDNet in a supervised way, where I i N and I i C are defined as the i-th noisy and clean image, respectively.n represents the number of image pairs in the training dataset.LIDNet also uses the popular Adam [25] to obtain reasonable parameters.The mathematical expression of the loss function is as follows: where L is a loss function of MSE and θ stands for learned parameters.

Dynamic Feature Extraction Block
The first layer in DFEB consists of a convolutional and a rectified linear unit (ReLU) [26] operation.The following 4 layers are composed of a convolutional, a batch normalization (BN) operation and an adaptively parametric rectified linear unit (APReLU) [27].And the final layer has an omni-dimensional dynamic convolution (ODConv) [28], BN and APReLU.In terms of parameter setting, the input channel number of the first convolutional operation is the same with the channel of the input images.If the input image is color, input channel number of LIDNet is 3. otherwise, input channel number of LIDNet is 1.Other numbers of input and output channels of all the layers are 64.Every size of convolutional kernels in LIDNet is set to 3 × 3.And the output of the DFEB is fused with the output of the CFEB via a concatenation connection.Mathematical expression of the DFEB is shown as follows:

Complementary Feature Extraction Block
The lower branch in a parallel architecture is CFEB, which is responsible for extracting complementary features by a different network architecture.The first layer in CFEB consists of a convolutional and a ReLU operation.And the following 5 layers are composed of a stacked combination of convolutional, BN, and APReLU operations.As shown in Figure 1, the difference between DFEB and CFEB is mainly reflected on the last convolutional layer.Specifically, the CFEB uses a common convolution operation to replace the ODConv as the final layer in the DFEB.Input and output channel numbers of the final convolutional layer are both 64.The mathematical expression of CFEB is as follows: where O CFEB is the output of CFEB.CFEB() expresses a function of CFEB.Conv 1 means the first layer of the CFEB and 5Conv 2 means 5 stacked layers in the DFEB, which is as the second layer to the sixth layer in the CFEB.5APReLU(5BN(5Conv())) is equal to 5Conv 2 .

Concatenated Purification Block
To refine fused structural information from DFEB and CFEB, CPB is set as the last part of LIDNet.Specifically, its first 10 layers in CPB are composed of convolutional, BN, and APReLU operations.And its last layer is simply a common convolutional operation, which is used to construct clean images.To construct a clean image, a residual learning operation is used to act between an input image and output image of LIDNet.The numbers of input and output channels are 64 except the output channel number of the final convolutional layer, which is the same as the channel of the input image.The mathematical expression of CPB is as follows: where O CPB is the output of CPB.CPB() expresses a function of CPB.10Conv 1 means the 10 stacked layers in CPB, which form the first layer to the tenth layer, and Conv 11 means the last layer of the CPB.10APReLU(10BN(10Conv())) is equal to 10Conv 1 .

Experiments 4.1. Datasets
The video quality of many courses inevitably declines due to the impact of the environment and equipment during shooting.To achieve better performance in multimedia, we propose LIDNet to denoise these teaching images.The architecture of our LIDNet is shown in Figure 1.
For image denoising with Gaussian noise, 400 images with sizes of 180 × 180 from Ref. [29] are used to train a denoising model.Three different denoising models with noise levels of 15 and 25 can be trained, respectively.To train a blind denoising model with noise levels from 0 to 55, a blind model is trained.Specifically, patch sizes are set to 40 × 40.
To fairly test denoising performance, public BSD68 [30], Set 12 [31], Kodak24 [32], and collected educational images from the Internet are used as test datasets.Guassian noise with noise levels of 15 and 25 is added on BSD68, Set12, Kodak24, and collected educational images from the Internet to test the denoising performance of the proposed method.

Parameter Setting
This paper has the following experimental settings.The number of training epochs is 180.The original learning rate is 1 × 10 −3 and it will decline to 0.2 times when the epoch is 30, 60, and 90, respectively.Batch size is set to 128.Adam is used to optimize parameters [25], where β 1 is 0.9 and β 2 is 0.999.More parameters can be found in Ref. [13].
The LIDNet can be trained on a PC with Intel Xeon Gold 6330 Processor and one Nvidia GeForce RTX 3090.Furthermore, all the codes run on Ubuntu 20.04 with Python 3.8, PyTorch 1.11.0, and CUDA 11.7.

Network Analysis
This paper uses a parallel network architecture to extract complementary information for image denoising, where a parallel network consists of an upper network (also regarded as dynamic feature extraction block, DFEB) and lower network (also regarded as complementary feature extraction block, CFEB).It connects a serial architecture (concatenated purification block, CPB) to extract more hierarchical structural information.Also, each branch in the parallel network is composed of six layers of stacked architecture.The upper network is composed of a Conv + APReLU, four Conv + BN + APReLU, and a ODConv + BN + APReLU, where APReLU [27] is composed of an attention mechanism and a ReLU is used to extract salient information and nonlinear information.Also, OD-Conv [28] utilizes convolutional kernel information and channel information to dynamically learn parameters to adaptively train a denoising model for different given noisy images.'LIDNet without global residual connection and ODConv + BN + APReLU' has an improvement of 0.013dB compared to 'LIDNet with only Conv + APReLU in the upper network' in Table 1, which describes the effectiveness of four Conv + BN + APReLU in the upper network for image denoising.Also, the denoising effect of 'Conv + APReLU' in the upper network is verified by 'The combination of lower network and CPB' and 'LIDNet with only Conv + APReLU in upper network and without global residual connection' in Table 1.To test the denoising performance of DFEB, we use 'The combination of lower network and CPB' and 'LIDNet without global residual connection' to conduct comparative experiments.As shown in Table 1, we can see that 'LIDNet without global residual connection' exceeds 'The combination of lower network and CPB' in terms of PSNR.That shows that DFEB in the parallel network is effective for image denoising.Additionally, to test complementarity of two sub-networks, 'The combination of lower network and CPB' is superior to 'CPB' in terms of image denoising in Table 1, which shows the superiority of a parallel network for image denoising.To prevent the interference of upper and lower networks, a serial network is set behind a parallel network to refine the obtained structural information for image denoising.Finally, a global residual connection is employed between outputs of the first layer in a lower network and the last layer in the CPB to construct clean images.

Comparisons with State of the Art
To test the effectiveness of proposed method, we choose several popular denoising methods, i.e., EPLL, BM3D, WNNM, DnCNN, image restoration CNN (IRCNN) [33], fast and flexible denoising network (FFDNet) [34], and a cascade of shrinkage fields (CSF) [35] as comparative methods on the BSD68 and Set12 to conduct experiments.As shown in Table 2, we can see that our LIDNet has obtained the best denoising result on the BSD68 for σ = 15 and σ = 25.For instance, our LIDNet has an improvement of 0.11 dB compared to IRCNN for σ = 15.That shows that our method is effective for gray noisy image denoising.To verify good denoising performance for a single gray noisy image, different methods on Set12 are used to conduct denoising effects.As illustrated in Table 3, we can see that our LIDNet has obtained the best denoising effect for single noisy image denoising.For instance, our LIDNet has obtained an improvement of 0.09 dB compared to a popular denoising method, i.e., WNNM for a noise level of 15.That shows that our method is a good denoising tool for low-frequency noisy image denoising.Our LIDNet has obtained an improvement of 0.06 dB compared to a popular denoising method, i.e., IRCNN for noise level of 25.That shows that our method is a good denoising tool for high-frequency noisy image denoising.According to that, we can see that our method is effective for single noisy image denoising.Furthermore, to further demonstrate the denoising performance of our LIDNet on color images, Table 4 records the denoising results from different models with different noise levels.Compared with popular methods, i.e., IRCNN, FFDNet, D-BSN, FL(NLM), and FL(BM3D), our LIDNet has also achieved improvements in denoising performance for color noisy images.This also proved the effectiveness of LIDNet in processing color noisy images.To comprehensively test the denoising effect of our proposed method, we use qualitative analysis to measure the effects of visual images.Specifically, we choose one area of denoising images from BM3D, FFDNet, IRCNN, and LIDNet as observation areas.If the observation area is clearer, its corresponding method shows better denoising performance.As shown in Figures 2-4, we can see that our LIDNet is clearer than the results of other methods.In Figure 3, other methods can obtain more incorrect texture information.Because real noisy images are difficult to obtain in the world, we choose Guassian noise added on educational images to test the performance of the proposed method for educational image denoising.In Figure 4, we can see that our method can obtain clearer detailed information for noisy educational image denoising.Thus, that not only shows that our method is superior to other methods in terms of qualitative analysis, but also that it is robust for different scenes in terms of image denoising.

Conclusions
Multimedia teaching systems have become a popular tool for online education.However, interacted images from a multimedia teaching system may suffer from noise.In this paper, we present a lightweight image denoising network as well as LIDNet for multimedia teaching systems.LIDNet uses a parallel network to mine complementary information.To improve robustness of the obtained denoiser, an omni-dimensional dynamic convolution is designed in one sub-network from the parallel network to automatically adjust parameters to achieve an adaptive CNN.That also enlarges the differences in network architecture, which can improve the denoising effect.To refine the obtained structural information, a serial network is set behind a parallel network.To extract more salient information, an adaptively parametric rectifier linear unit composed of an attention mechanism and a ReLU is used in LIDNet.Experiments show that our LIDNet is effective in image denoising, which can also provide assistance for multimedia teaching systems.
where O DFEB is the output of DFEB.DFEB() expresses a function of DFEB.Conv 1 means the first layer of the DFEB, 4Conv 2 means 4 stacked layers in DFEB, which form the second layer to the fifth layer, and Conv 6 means the last layer of the DFEB.Conv stands for a function of a convolutional operation, ReLU stands for an activation function of ReLU, BN stands for the batch normalization operation, APReLU stands for another activation function of APReLU, and ODConv stands for ODConv operation.4APReLU(4BN(4Conv())) is the equation for the 4Conv 2 .

Table 1 .
Denoising results (average PSNR (dB)) of several networks on BSD68 for noise level of 25.

Table 2 .
Average PSNR (dB) results of several networks on BSD68 for noise levels of 15 and 25.

Table 3 .
PSNR (dB) results of different methods on Set12 with noise levels of 15 and 25.

Table 4 .
Average PSNR (dB) results of different methods on CBSD68 and Kodak24 datasets with noise levels of 15 and 25.