Deep Learning Based on Fourier Convolutional Neural Network Incorporating Random Kernels

: In recent years, convolutional neural networks have been studied in the Fourier domain for a limited environment, where competitive results can be expected for conventional image clas-siﬁcation tasks in the spatial domain. We present a novel efﬁcient Fourier convolutional neural network, where a new activation function is used, the additional shift Fourier transformation process is eliminated, and the number of learnable parameters is reduced. First, the Phase Rectiﬁed Linear Unit (PhaseReLU) is proposed, which is equivalent to the Rectiﬁed Linear Unit (ReLU) in the spatial domain. Second, in the proposed Fourier network, the shift Fourier transform is removed since the process is inessential for training. Lastly, we introduce two ways of reducing the number of weight parameters in the Fourier network. The basic method is to use a three-by-three sized kernel instead of ﬁve-by-ﬁve in our proposed Fourier convolutional neural network. We use the random kernel in our efﬁcient Fourier convolutional neural network, whose standard deviation of the Gaussian distribution is used as a weight parameter. In other words, since only two scalars for each imaginary and real component per channel are required, a very small number of parameters is applied compressively. Therefore, as a result of experimenting in shallow networks, such as LeNet-3 and LeNet-5, our method achieves competitive accuracy with conventional convolutional neural networks while dramatically reducing the number of parameters. Furthermore, our proposed Fourier network, using a basic three-by-three kernel, mostly performs with higher accuracy than traditional convolutional neural networks in shallow and deep neural networks. Our experiments represent that presented kernel methods have the potential to be applied in all architecture based on convolutional neural networks.


Introduction
The convolutional neural network (CNN) is the most basic neural network based on solving problems of various machine learning tasks, such as classification [1], segmentation, and denoising in computer vision. One of the problems with CNN training is that the convolution operation of all convolutional layers requires considerable cost. In particular, as the size of the image or kernel increases, the amount of computation inevitably increases, resulting in a latency of learning. One method proposed to solve this problem is to change the domain through Fourier transform, and construct a CNN in the frequency domain because the convolution operation in the spatial domain is the same as the pointwise multiplication in the Fourier domain. In general, point-by-point multiplication is more uncomplicated and computationally cheaper to compute than convolution. Prior approaches have focused on improving computational speed to handle the time cost problem [2][3][4].
There are two factors in determining computational complexity. One is the time complexity that was implemented in existing studies, and the other is the memory complexity that was studied for model weight reduction in the spatial domain [5][6][7][8][9][10]. However, previous studies on the Fourier domain were not conducted on a method of reducing the number of parameters that directly affect the complexity of memory. The efficient use of memory is a critical issue since unlimited resources are not provided in the real world. A recent example is an application used in a mobile device that requires a high speed and a lightweight model. In addition, since the number of GPUs or memory is limited, building a neural network that can learn sufficiently with a small or few GPU is important. In addition, CNNs are used in a variety of engineering areas for practical applications, such as conditions monitoring of marine vehicles [11] and fault diagnosis of the cerebral cortex [12]. Therefore, in order to propose a method to perform deep learning in a limited environment in further work, we propose the method of reducing the number of parameters of the convolutional neural network, which is the base of the neural network. Furthermore, previous studies on CNNs in the Fourier domain have focused on the study of efficient neural networks for high-speed training, and the problem of reducing the number of parameters was actively investigated in the spatial domain rather than the Fourier domain; therefore, the goal is to design an efficient CNN by applying kernel methods with a few parameters.
In previous studies, the implementation and pooling method of the convolutional layer corresponding to the spatial domain was actively investigated in the Fourier domain [13]. While the ReLU-based activation function is known to be effective in the spatial domain, an appropriate activation function in the Fourier domain has not been established. The previous approach mainly relies on the approximation of ReLU. However, it is limited in that it has a huge computational cost, cannot function as a nonlinear operation, and cannot expect a higher, or the same, accuracy as ReLU in the spatial domain [14][15][16]. On the contrary, we present the activation function used in the Fourier domain, which performs the same operation as ReLU in the spatial domain, and introduces a new Fourier convolutional network that applies a new activation function for the Fourier domain. The novel activation function in the Fourier domain is built upon the characteristics of the Fourier transformed image consisting of phase and magnitude, which will be covered in detail in Section 3.
There are two major studies on weight reduction in the Fourier CNN. The first is to adjust the kernel size in our proposed Fourier CNN, and the second is to learn the standard deviation (std.) for the Gaussian distribution by creating a random kernel based on compression sensing. First, unlike the spatial domain with local information, Fourier transformed images have global information. In the spatial domain, a large-sized kernel can be used by finding the location and information of the pixel locally to extract the characteristics of the image. On the contrary, in the Fourier domain, it is expected that even a small-sized kernel will be able to sufficiently identify the characteristics of an image and perform classification tasks by using global information consisting of low-and highfrequency components. Second, in compression sensing, a random vector is generated by multiplying the sparse signal by a random matrix of Gaussian distribution to compressively restore the original signal. According to this theory, it is assumed that using a random filter can learn scalar values for a random matrix of a fixed Gaussian distribution, and therefore, by learning a standard for a Gaussian distribution, image classification can be performed with a few parameters.
In conclusion, the contributions of the paper are listed as follows: we present a new activation function in the Fourier domain, discard the unnecessary shift in the Fourier transform process, introduce the novel convolutional neural network, using a small-sized kernel in the Fourier domain based on our proposed activation function, and investigate an efficient convolutional neural network based on the random kernel in the Fourier domain to reduce the number of weight parameters.

Related Work
Convolutional neural networks (CNNs) have been used to extract and learn image features from the deep neural networks in computer vision such as classification, segmentation, etc. [1]. In addition, deep neural networks, such as AlexNet, VGG, DeseNet and ResNet [17][18][19][20], have been widely developed and have been effectively applied to various tasks using large datasets, such as ImageNet [20].
However, deep neural network models have a problem that the number of learning parameters increases as the layer becomes deeper. Therefore, significant memory costs for learning a model and high computational costs are required.
In particular, it is difficult to apply in a limited environment of mobile devices with limited hardware resources. As one of the methods to solve this problem, model compression [6,8] has been explored. For example, ShuffleNets are designed to optimize for the mobile device environment [7]; MobileNet-based models are described to reduce the number of parameters through depth-wise separable convolution [5,9]; and SqueezeNet shows similar performance to AlexNet with fewer parameters by reducing the number of input channels in the 3 × 3 filter and replacing the 3 × 3 filter with the 1 × 1 filter [10].
Other techniques for model compression include pruning [21,22], distillation [23] and quantization [24]. First, pruning is a method of removing neuron or weights with less important information [21,22]. Second, distillation refers to a mechanism of transferring the knowledge of a more extensive ensembled neural network to a relatively small single neural network in order to solve the inefficient use of memory resources that generally occurs when a model is ensembled [23]. Third, quantization is a technique of minimizing the loss of accuracy versus full precision while using a low bit width [24].
However, these methods have a fundamental limitation in that they cannot solve the time computational problem of convolution of the image and the kernel in the convolutional layer, which is the stage of learning features of the image. In recent years, to solve the time-cost problem in a convolutional layer, a CNN in the Fourier domain through Fourier transform was actively studied, using the theory that convolution in the spatial domain is equivalent to point-wise multiplication in the Fourier domain [14,16,[25][26][27].
The convolutional layer of the Fourier domain for time complexity has been widely explored because point-by-point multiplication in the Fourier domain is much faster than convolution in the spatial domain. Furthermore, one of the leading fast Fourier transform techniques is based on discrete Fourier transform (DFT). In general, the Cooley-Tukey algorithm is used for the fast Fourier algorithm. However, training the convolutional layer in the Fourier domain requires an additional operation-inverse Fourier transform. Therefore, pooling and activation functions in the frequency domain for fully training the CNNs in the frequency domain have been studied for many recent years [2][3][4].
First, truncating the low-frequency components of the Fourier transformed image into a predetermined size, which are used as a spectral representation and for extracting only important information, has been proposed as a method of implementing spectral pooling. However, because the Fourier transform is performed before spectral pooling and the inverse Fourier transform is used after each pooling, there is an additional computational cost to implement the iteration. Moreover, the proposed method has not considered the process of training the convolutional layer in the Fourier domain and has not examined the problem of computational cost [13]. Another proposed method of pooling is discrete Fourier transform (DFT) based magnitude pooling. The first component of the Fourier transformed image is the DC component. DC stands for direct current in electrical engineering, but it simply refers to the zero frequency or the mean value of the frequencies in the Fourier domain. The whole process of training is implemented by calculating the magnitude of the DFT and reducing the resolution to include the first component from the values obtained. However, the phase information is not considered in the DFT-based pooling method, even though preserving both the phase and magnitude of the Fourier transformed image is vital for reconstructing the image after inverse Fourier transform. In addition, using a number of parameters for creating ensemble networks is difficult to regard as an efficient network in the frequency domain [28].
Second, suitable activation functions in the Fourier domain have been actively investigated. Research on the activation function in the Fourier domain is largely divided into two main directions. One is to roughly estimate the activation function in the Fourier domain, which has a shape similar to the activation function in the spatial domain [15,16]. The other is to present a new formula for the Fourier-based activation function, taking into account the properties of the frequency components [14]. One of the most popular functions is spectral ReLU (SReLU) [15], which is designed to approximate the conventional ReLU function in terms of a quadratic function. The basic idea of SReLU is to find a quadratic function that is determined to be roughly similar to ReLU in the spatial domain. However, calculating the quadratic function for each activation function has considerable time complexity [29]. Especially when the input size is large or the layer depth is deep, performing the activation function by SReLU causes a computational burden.
Another approach to implementing the approximation function in the Fourier domain is to find a linear function similar to the tanh and sigmoid functions in the spatial domain, using the linearity property of Fourier transform. However, the presented linear functions cannot perform as nonlinear functions, making it tough to train the complex model.
One of the recent approaches of an activation function is using the property of lowand high-frequency components in the Fourier domain. For instance, a second harmonics superposition activation function (2SReLU) has been proposed to overlap the first and second harmonics, including the DC component of the Fourier transformed image. Since the first harmonic of the image contains low frequencies and the second harmonic of the image has some high frequencies, the neural network can be trained at both low and high frequencies. In addition, because the Fourier transformed image is composed of complex numbers, it is expressed as a magnitude value of the real and imaginary parts of the image plus periodic functions, such as cosine and sine functions, respectively. Considering the composition of the Fourier transformed image function, adding several harmonics causes the sin wave function to converge to zero, like the negative part of ReLU [14]. According to Equation (1), F refers to the Fourier transform, each h i is the ith harmonic or interval, and hyper-parameters alpha and beta are predetermined to 0.7 and 0.3. After multiplying each alpha and beta for the first and second harmonic weight, respectively, two harmonics are added as follows: Yet, in terms of measuring the accuracy of the classification task, 2SReLU [14] is poorly fit to Fourier-based CNNs, compared to the previous activation function, SReLU [15]. Therefore, our novel activation function is focused on the activation function that fits the Fourier domain, while considering the characteristics of the frequency domain image.
Recently, several complex value-based activation functions, such as modReLU [30], zReLU [31], and complex ReLU ( CReLU [32]) were introduced. First, modReLU is defined as Equation (3), and it refers that the activation is applied when a learnable bias term b is positive, where z is a complex number and the phase of z is denoted as θ z . The equation is designed to preserve the pre-activated phase information [30].
Second, zReLU is described as Equation (4). The equation refers that zReLU maintains the input number z when the phase exists in the first quadrant; otherwise, it replaces it with 0 [31].
Third, CReLU is the latest complex number-based activation function that obtains more information than modReLU and zReLU, which is explained as Equation (5).
Assuming that the positive and negative values of the complex-valued image are represented in the four quadrants, CReLU has the advantage that information can be obtained from the remaining three quadrants, except when both real and imaginary components are less than 0 since CReLU applies ReLU to the real and imaginary components, respectively. On the other hand, in the case of modReLU, due to the learning bias term b, a circle with a radius of length b is inactive, and the outside of the circle is active. zReLU is active only when the input phase is in the first quadrant.
Furthermore, Fourier-based CNNs were researched to complete training entirely in the frequency domain before entering through fully connected layers [25][26][27] to eliminate additional computations when performing inverse Fourier transform after applying activation [16] or pooling layers [15,33]. However, the previous training method of CNNs within the Fourier domain has several limitations. First, the activation function of the Fourier domain corresponding to ReLU in the conventional CNN in the spatial domain has not yet been explicitly described [25][26][27]33]. Second, the existing research on reducing memory cost in the Fourier domain was not conducted in the entire Fourier domain. In addition, the previously presented tanh-based activation function in the spectral domain performs on a different principal from ReLU in conventional CNN [33]. Third, earlier studies on memory cost only considered zero sparsity, using model compression. However, applying a kernel with a small number of parameters can be another efficient training process and has the advantage that it can be applied to various CNNs, regardless of the architecture, without being limited to a compressed model. Therefore, we intend to design an efficient CNN in the Fourier domain by reducing the number of parameters, using different kernel methods that directly affect energy reduction in the memory aspect.

Compressed Sensing
Compressed sensing was used to reconstruct original signals and images with a small number of samples, and was developed from traditional Shannon-Nyquist sampling to convert an analog signal to a digital signal. The Shannon-Nyquist theorem is defined as the original signal being taken at least twice as much as the highest frequency samples. In other words, reconstruction is available only when sampling is acquired by satisfying the Nyquist sampling rate. Otherwise, it is not easy to restore the original data since aliasing occurs. In general, data obtained from nature are measured with a device, such as an analog-to-digital converter (ADC), to convert the data from nature to digital. Only a few data are extracted, as the data scan speed can be slow, or the machine can be expensive.
On the other hand, samples in limited environments can be reconstructed through compression detection with samples that are less than the Nyquist sampling rate [34]. The most basic purpose of this theory is to compress and sample the model by converting the under-determined system into an over-determined system, using sparsity. In general, the scarcity of data is found by some raw data with a small amount of information, such as sound, video, and image. In addition, sparse signals can be measured when the original data are transferred to a specific domain. According to this principle, when the original signal to be restored is x and the measured signal is Y, the signal processing of compression sensing can be defined as shown in Equation (6) [34].
As shown in Equation (6), x is a signal vector of size N × 1, and Y is a signal vector of size M × 1. According to the theory of compression sensing, the condition of N ≤ M must be satisfied. Additionally, M can be defined as a much smaller amount of data than N. The main point of Equation (6) is to find the M-by-N matrix to transform the under-determined system into an over-determined one. The measurement matrix φ, such as an ADC device, is generally determined by a random distribution, such as Gaussian [35]. In addition, the original signal x can be expressed as a product of a sparse signal and a sensing matrix, as shown in Equation (7), where s is a sparse vector of size N × 1, and ψ is a matrix of size N × N.
x = ψs (7) In other words, Equation (7) refers that s is the coefficient on a ψ basis. Y is newly summarized as shown in Equation (10) by Equations (8) and (9). In other words, Y can be expressed as the product of Θ, which is a sensing matrix of size M × N, and a sparse vector s.
The property of Θ, a sensing matrix, which is also called a reconstruction matrix, is determined by the definition of the restricted isometry property (RIP) of Θ, and RIP is expressed by Equation (11).
The isometry constant δ I (I > 0) in Equation (11) is the smallest constant value, and a suitable sensing matrix can be acquired because sparse signal s is almost preserved when δ I is near zero, and the left and right terms of the two inequalities approach each other. By finding I that satisfies these conditions, we can obtain a suitable sensing matrix that can be restored. In particular, Θ should be a random sensing matrix for satisfying the RIP property; therefore, it is important to have a randomly distributed matrix for Θ [34]. The reconstruction process of compressed sensing is shown as Figure 1, which is inspired by [34]. According to Figure 1, the measurement vector Y is a vector of size M × 1, which is the result of the product of the M × N measurement matrix φ and N × 1 original signal x, where M ≤ N. The original signal x can also be represented by the product of the N × N random representation matrix and N × 1 sparse vector s. According to these two formulas, the Y vector can be expressed as the multiplication of the M × N random matrix Θ (= φψ) and the N × 1 sized sparse vector s.

Random Kernel
The traditional method of learning an image classification task is to train a neural network through convolutional neural networks (CNNs) in the spatial domain. Within a large category, CNNs are divided into convolutional layers for learning the features of an image, and a classifier determines the classes of images called fully connected layers.
The convolutional layer is formed by the convolutional operation of an input image and a kernel composed of learnable parameters. The convolutional layer based on the existing LeNet-5 model [1] proposes a kernel of 5 × 5 size; for the AlexNet model [20], kernels of various sizes, such as 11 × 11, 5 × 5, and 3 × 3, are presented. In other words, the LeNet-5 model requires 25 trainable parameters per channel, and the AlexNet model needs 121, 25, or 9 parameters, depending on the kernel size. Since the number of learnable parameters increases in proportion to the size of the kernel, and many weight parameters affect the expensive memory cost, various methods of reducing the number of parameters in CNN-based models are actively investigated [5,9,10,[21][22][23][24].
Many architectures have been proposed to reduce the number of parameters of CNNs in the spatial domain, but there is a limitation, as the number of parameters used in the convolutional layer is different depending on the architecture of each proposed model [5,9,10]. For example, if the k is the size of the kernel, considering that the number of parameters is determined by (batch size) × (the number of kernel) × (the number of input channel) × (K h × K w ), the number of weight parameters is primarily affected by the size, and the number of parameters can increase rapidly as the kernel grows. Therefore, we propose a random kernel similar to a random vector that reconstructs the original signal by multiplying a random matrix and a sparse vector in compressed sensing. Our random kernel sets the kernel size, one of the main factors determining the number of parameters, to be the same for all architectures so that the same performance as the conventional CNN kernel can be achieved with fewer parameters. As shown at the top of Figure 2, conventional kernels have (kernel height x kernel width) number of parameters, for example, the sizes of 3 × 3, 5 × 5 and 7 × 7 kernels, in turn, have 9, 25 and 49 learnable parameters, respectively. On the contrary, our proposed random kernel only has one learnable parameter denoted as α, regardless of the size of the kernel; therefore, all 3 × 3, 5 × 5 and 7 × 7 sized kernels have the same number of parameters in a single channel, according to the bottom of Figure 2. In addition, the random kernel consists of the product of a single trainable scalar and a untrainable fixed random matrix initialized with a Gaussian distribution. Therefore, the number of parameters of the random kernel can be expressed as (batch size) × (the number of kernels) × (the number of input channel) × (α). In other words, the learning parameter for each convolutional layer is determined by the number of kernels and the number of input channels. In conclusion, we can expect the advantage of the random kernel to be as follows: first, a random kernel reduces the number of parameters exponentially regardless of the size of the kernel or the depth of the neural network's layers; second, the random kernel can be applied to any types of CNN based models; third, we can potentially expect this to be a method for a light-weight model, performing with cheap memory costs.

Fourier Convolutional Neural Network
The convolutional layer of convolution-based neural network in the spatial domain is configured to learn the features of an image through a convolution operation between an image and a kernel. However, the convolution operation has a disadvantage in that the calculation is complicated because the input image is multiplied and added by rolling a window as large as the kernel size.
When the image in the spatial domain is transformed into the frequency domain through the Fourier transform, the computational cost is reduced because a simple operator called point-wise multiplication is used instead of the convolution operation. Assuming that functions f and g of x are given in the spatial domain, and Fourier transformed F and G are the functions of X in the frequency domain, Equation (12) is described as follows: For example, given input x ∈ h×w and kernel k ∈ k×l in the spatial domain, the time complexity is expected as O(hwkl). On the other hand, F(x) ∈ C M×N with fast Fourier transform incurs a computation cost of O(mnlog(mn)). Recently, in order to train image classification tasks in the Fourier domain, finding an adequate activation function [14,15,36] and sub-sampling operation [13] in the Fourier domain equivalent to ReLU or max-pooling in the spatial domain was widely explored. However, one of the limitations of the existing studies on Fourier-based activation functions is that the activation functions do not exactly match ReLU. Therefore, this section introduces a Fourier-based activation function that performs the same behavior as ReLU in the spatial domain, and presents a convolutional neural network for fully training in the Fourier domain. Moreover, we remove unnecessary computations during the process of reducing additional computational costs and Fourier transform. First, low-and high-frequency components of all Fourier-transformed images are used without shift Fourier transform operation. Second, the process to Fourier transform the kernel, which is used in the traditional convolutional neural network, is omitted by using the property that the Fourier-transformed Gaussian is also Gaussian. Lastly, by applying the random kernel discussed in Section 3.1.2 to a Fourier-based convolutional neural network, we aim to build an effective convolutional neural network with similar accuracy to conventional spatial convolutional neural networks with few parameters.

Fourier Convolutional Layer
Given any number x ∈ , Fourier transform f : F → C is defined as in Equation (13).
The standard practice has rearranged the frequency components in previous work by shifting the low-frequency portion to the image center as shown on the right side of Figure 3. The first component of the Fourier transformed image represents the zero frequency or the DC component representing the mean in the frequency domain, which is also relocated to the center of the image. Unlike the traditional method, we eliminate the additional operation by removing the shift operation. Without the Fourier shift operation, the input image is first channel-wise Fourier transformed, and each low-frequency and high-frequency component is at the edge and center of the image, as described in the center image of Figure 3. In addition, the DC component is located at the top left corner.  The converted input F(x) performs an element-wise multiplication for each channel with the kernel F(k) ∈ C M×N of the same size as the input x. Given input image x ∈ C height×width and kernel k ∈ k×l , the most basic technique of the kernel's Fourier transform is to apply the Fourier transform after zero-padding at the right and bottom of the kernel with size (width-l) and (height-k), respectively.
Our proposed Gaussian random kernel fulfills two purposes. The first is to use a random kernel that learns only one alpha parameter per channel, as in the method introduced in Section 3.1.2. The second is to reduce the computation cost by removing the two steps of zero-padding and Fourier transform, using the property that the Gaussian function is also a Gaussian function even after the Fourier transform. Our proposed random kernel is the same as multiplying the Gaussian distributed random matrix by the alpha representing the distribution scale, so it learns the standard deviation (std.) of the Gaussian distribution in the Fourier domain.
Since the Fourier transformed image F (x) is complex-valued, the random kernel K should also be initialized to a complex number of the same size. In the case of a convolutional layer based on a random kernel as shown in Figure 4, each real and imaginary component of the random matrix is initialized to the Gaussian distribution, respectively. In practice, however, point-wise multiplication in complex-valued domain is performed as Equation (16). According to the multiplication of the rectangular form of complex number, given F (x) = A + Bi and K = C + Di (F (x) ∈ C,K∈ C) is defined as follows: In addition, standard deviation α is initialized to α = α + α i for complex-valued scalar multiplication for the result of the product of F (x) and K. As illustrated in Figure 4, our presenting convolutional layer based on the random kernel is expressed as the following Equation (15).
Practically, our proposed method of implementing the convolutional layer in the Fourier domain is divided in two ways, as shown in Figure 4. One is the random kernelbased convolutional layer, and the other is the convolutional layer with the 3 × 3 sized kernel regardless of the size of the conventional kernel used for the convolutional layer in the spatial domain.

Fourier Activation Function: PhaseReLU
The standard convolutional neural networks (CNNs) convert the affine function to nonlinear after the activation function, and thus help to construct more complex and various types of models by increasing the learning capacity. The most stable and widely used activation function in the spatial domain is the ReLU function [37]. However, the corre-sponding activation function in the Fourier domain is not officially established. In particular, as a result of comparing the officially proposed complex-valued activation function, CReLU, zReLU with ReLU for the baseline and the state-of-the-art Fourier activation function 2SReLU, they all perform poorly. Therefore, we propose a new Fourier-transformed activation function, PhaseReLU-suitable to substitute for ReLU.  The Fourier transformed image z can be expressed in rectangular form and polar form, described in Equations (17) and (18), respectively.
In addition, the Fourier transformed image x is composed of phase ∠x or φ, and magnitude |x|. Let us assume that φ is a phase of the image; then, each a ∈ and b ∈ is also expressed as cos φ and sin φ, respectively. According to Equations (19) and (20), each component of the magnitude and phase in the polar form is described as follows: Since the magnitude is equal to the square root of multiplying real and imaginary components, it is always positive. In the case of the phase, which is denoted as φ z ∈ (−π, π], it can be negative or positive, depending on the given complex value. Using these properties, we first take the feature map received from the convolutional layer, and decompose it into the phase and magnitude. We only consider when the phase is φ z ∈ (− π 2 , π 2 ], so we can expect z > 0. Therefore, same as the ReLU, the original values are preserved when the given complex value is positive and the values are replaced with zeros for the negative. In order to pass the activation map to the pooling process, the phase and magnitude should be recomposed after the previous step. Our newly proposed activation function in the frequency domain, PhaseReLU, is proposed in Equation (21). Our implementation of comparing the accuracy for the PhaseReLU to the existing activation functions, such as CReLU, zReLU, 2SReLU and baseline ReLU with Fashion-MNIST datasets on LeNet-3, is shown in Table 1. As a result, PhaseReLu can replace ReLU, as it can perform classification tasks better than baseline ReLU and other previous activation functions.
PhaseReLU(z) = |z|(cos(ReLU(φ z )) + i sin(ReLU(φ z ))) (21)  One of the crucial processes of convolutional neural networks is down-sampling, which reduces the image resolution to prevent over-fitting and latency of the model. For example, the max-pooling layer and average-pooling layer are widely used. The most famous pooling method in the Fourier domain, presented as the substitute of max-pooling, is spectral pooling [13]. Spectral pooling is explained as cropping the low-frequency components, the center of the image, without losing any considerable image information values. However, because we omit the shift operation, the image is truncated at four edges. In order to implement our down-sampling the image, the cyclic shifting theorem introduced in the FFTW library [38] is considered. Our proposed sub-sampling spectral pooling is represented in two steps. First, the kernel is padded and expanded with the input size. Second, the kernel is shifted and wraps around the image in two dimensions considering the kernel cases. According to Figure 5, when the activation map is denoted as I and the expected size of the kernel is k × k, the spectral non-shift pooling method is described in two ways, depending on whether the kernel size is divisible by four. For example, given k 4×4 , which can be divided by four (8 mod 4 = 0), and k 6×6 , which cannot divided by four (6 mod 4 = 1), they are represented as following the top of Figure 5 and the bottom of Figure 5, respectively. Our proposed low-frequency cropping method without using the cyclic concept is shown on the right side of Figure 5.

Architecture of Efficient Fourier CNN
As shown in Figure 6, let us assume that X i , k i , R i and α i or i represent the ith spatial input, spatial kernel, random kernel and scalar, respectively. Each F(y) i and F(y ) i for ith layer refers to the output of a Fourier CNN and an efficient Fourier CNN, respectively. Both architectures of the Fourier CNN and efficient Fourier CNN are entirely trained in the frequency domain but slightly differ in applying the kernel. First, the kernel of the Fourier CNN is filled with zeros to fit the spatial domain to the same size as the input image size. Compared to conventional CNNs in the spatial domain, Fourier CNNs are set up with a 3 × 3 size kernel for a given CNN-based model, requiring only nine parameters per channel. Then, the Fourier transformed kernel is multiplied by the Fourier transformed image. However, there are concerns that the number of parameters of a Fourier CNN can increase dramatically as the number of layers deepens. Thus, we propose another method, an efficient Fourier CNN, based on a compressed random kernel that exponentially reduces the parameters. The bottom of Figure 6 shows that the efficient Fourier CNN starts from the multiplication of the random kernel generated by Gaussian distribution and the Fourier transformed image. For an output feature map of complex values, the scale of a Gaussian distribution in the complex domain is expected to be learned by multiplying the scalar or alpha values for real and imaginary numbers, respectively.
After the convolutional layer is performed, the Fourier-based activation function, PhaseReLU (or PReLU) is implemented. Since no cyclic shift theorem is applied to the pooling method in the frequency domain, the activation map is conducted with non-shift spectral pooling (FPool). Calculating the magnitude (Mag) containing important image information before reaching the fully connected layer (FC layer) converts the values to real numbers rather than complex numbers. Finally, the FC layer is implemented after vectorization to classify a given image.
To summarize, compared with prior studies on convolutional neural networks in the frequency domain, the proposed methods have three main differences. First, we focus on the light-weighting of the CNN-based model in the frequency domain. Second, we introduce a method to entirely implement CNNs in the Fourier domain without inverse Fourier transform. Third, we present a competitive activation function in the frequency domain.

Experimental Results
We evaluate our proposed methods, Fourier CNN (F-CNN) and efficient Fourier CNN (EF-CNN), in the frequency domain on various gray-scale image datasets for shallow neural networks and deep neural networks. We convert color image datasets, such as CIFAR-10, CIFAR-100 and SVHN, to gray to set all datasets as gray-scale images. We also measure the accuracy versus the number of parameters for our two methods and a baseline method. The baseline method is a traditional convolutional neural network (CNN) in the spatial domain.

Analysis of Fourier CNN
We evaluate Fourier CNN for shallow networks, especially for LeNet-3 and LeNet-5. In order to reduce the number of parameters, we conduct an experiment using a 3 × 3 size kernel in the Fourier domain instead of the traditional 5 × 5 sized kernel in the spatial domain. All results of the model accuracy are the average of the last ten epochs over ten different files.

Shallow Fourier Neural Network
We first test the Fourier CNN method on a 28 × 28 size MNIST dataset consisting of 10 classes with 60,000 training images and 10,000 test images. We examine our method on the Fashion-MNIST dataset of size 28 × 28, which has the same number of training and validation sets as MNIST, consisting of 10 classes, but is more complex to learn.
To observe the results of training the presented method on larger gray-scale images, we evaluate 32 × 32 gray-scale image datasets, such as CIFAR-10 (G), CIFAR-100 (G), and SVHN(G). The CIFAR-based dataset and SVHN consist of 50,000 and 73,257 for training images and 10,000 and 26,032 for test images, respectively. There are three primary purposes for observing these datasets. First, since these are more complex than the original gray-scale images, we make sure that our method is well trained on complex images. Second, we check how our method is validated for more classes through the CIFAR-100 (G), which consists of 100 classes. Third, we examine the applicability of our proposed method for real-world data with SVHN, a real-world numerical data set.
We train Fourier CNN on LeNet-3 and LeNet-5 with one GPU at 70 epochs with a mini-batch size of 128. The model is optimized by stochastic gradient descent (SGD), using a double sigmoid function with the learning rates 0.001, 0.01, and 1 × 10 −5 for the initial, top and final, respectively. We also adopt the weight decay of 0.0005 and the momentum of 0.9 for SGD. The weight parameters of kernels are initialized by a normal distribution with a mean of 0 and a std. of 1.
We compare the accuracy and the parameter quantities of existing CNNs, fully connected layer (FC) and our Fourier CNN (F-CNN), for LeNet-3 in Table 2 and LeNet-5 in Table 3 on given various datasets. To verify the performance of the convolutional layer of the CNN-based model, we also conduct an experiment on a simple fully connected layer to which the convolutional layer is not applied. According to Tables 2 and 3, we confirm that the existing CNN and F-CNN have higher accuracy than the FC-based method for both LeNet-3 and LeNet-5 on all datasets; therefore, we argue that the convolutional layer that extracts image features plays an important role in increasing the accuracy.
We further observe the accuracy per the number of parameters of CNN in the spatial domain and our method, F-CNN. Table 2 shows that our F-CNN on LeNet-3 mostly outperforms CNN and implements similar results in all datasets, despite using a kernel of size 3 × 3 to reduce 1654 parameters. In particular, we find that the accuracy of our F-CNN is improved by 0.1%, 0.4% and 1.5% on Fashion-MNIST, CIFAR-100(G) and SVHN, respectively. We have further experiments on LeNet-5, as shown in Table 3, where F-CNN also leads to competitive accuracy with CNN in most of the datasets. For example, each Fashion-MNIST, CIFAR-10(G), and CIFAR-100(G) improve accuracy by about 0.1%, 0.4% and 1.4%, respectively, and SVHN shows the same results as the existing method.
To summarize, our presented Fourier CNN shows higher or competitive results on an image classification task with a small number of weight parameters in the shallow network.

Deep Fourier Neural Networks: VGG09, VGG11 and VGG13
We conduct several experiments for deep neural networks, such as VGG-09, VGG-11 and VGG-13, on the same dataset conducted in the shallow networks. In order to prevent over-fitting in the deep neural network and check the performance of our method, we resize the image resolution to double the size. For instance, MNIST and Fashion-MNIST are expanded to 56 × 56, and CIFAR-10, CIFAR-100, and SVHN are resized to 64 × 64. The hyperparameters and experimental environment settings are applied the same as for the shallow network, except for 100 epochs.   Table 4 shows the accuracy results of CNN and Fourier CNN for VGG-09 and VGG-11. Overall, in the VGG-09 model, F-CNN shows almost the same accuracy rate as traditional CNN, and in VGG-11, it is improved by 0.1% and 0.2% in FashionMNIST and SVHN, respectively. We further test on the VGG-13 model as shown in Table 5. As a result, our F-CNN has almost the same accuracy as CNN in the spatial domain for all datasets and shows an improvement of 0.9% on CIFAR-100. In conclusion, we argue that CNN in the spatial domain can be replaced with our proposed method, F-CNN, in the Fourier domain.

Analysis of Efficient Fourier CNN
According to the previous observation, we attempt to reduce weight by converting the 5 × 5 kernel size to a 3 × 3 kernel size in the shallow network. For further work, we test our efficient Fourier CNN (EF-CNN) in shallow networks with even fewer parameters in order to check whether EF-CNN is competitive to conventional CNN. We train our method on a model with epoch 70, and we use the same hyper-parameters as in Section 4.1.1. In the weight initialization, the imaginary part random filter and the real part random filter are initialized to the normal distribution, respectively, and the scalars for real and imaginary are initialized to 1.  Tables 6 and 7 show the results of the experiments on the CNN, FCNN, and EF-CNN models on gray-scale image datasets from LeNet-3 and LeNet-5, respectively. The advantage of the efficient Fourier CNN is verified to achieve very similar performance to CNN with fewer parameters in shallow networks. To summarize, one of our proposed methods, F-CNN, uses a 3 × 3 kernel size to perform image classification with a high accuracy rate, applying a small number of parameters in shallow networks. The other method is EF-CNN, which uses a random filter instead of the conventional Fourier kernel in order to learn real and imaginary scalars for each channel.
As shown in Table 8, we compare the number of parameters when training a 28 × 28 sized image dataset in LeNet-5. First, in the case of Fourier CNN, the number of trainable parameters in the first convolutional layer (Conv1), Conv2, and Conv3 are reduced by about ×0.34, ×0.35, and ×0.35, respectively. In addition, the total trainable parameter in LeNet-5 is reduced to ×0.94. According to our experiment on EF-CNN, all three layers are reduced to ×0.07. Therefore, we notice that a total of ×0.91 trainable parameters are trained for our method.
The result of Table 8 shows that our proposed methods can dramatically reduce the number of weight parameters in convolutional layers. Since the convolutional layer deepens for state-of-the-art deep neural networks, which require a massive amount of trainable parameters, our proposed methods have the potential to efficiently train with fewer parameters. We also observe the comparison of the total number of trainable parameters for the shallow network, according to Table 9. Our final statement about the experiments of F-CNN and EF-CNN is that instead of traditional CNNs in spatial domains where more parameters are required, these two methods can become future replacements for training image classification tasks.

Discussion
In this work, we demonstrated that CNNs in the Fourier domain, using our presented activation function, PhaseReLU, and random kernel, have the potential for computational efficiency, especially in reducing the number of parameters.
One possible further task is to extend the image to the video task and consider different types of CNNs, such as LSTMs and RNNs. Other types of transforms, such as wavelets and DCT, also can be candidates. In addition, computational costs and resources need to be considered in future work, which can be addressed by importing custom libraries instead of using our manually coded system.

Conclusions
We have transformed the convolutional neural network in the spatial domain to the frequency domain by Fourier transformation to perform image classification in reducing parameters efficiently. We also have introduced the newly proposed PhaseReLU, which is equivalent to ReLU in the spatial domain. In addition, we have removed the unnecessary step of the shift theorem operation used in the Fourier domain, thereby opening the possibility of preventing latency in the future. In order to implement the CNN weight parameters reduction technique in the Fourier domain, the Fourier CNN was always trained with a kernel size of 3 × 3 in the Fourier domain, regardless of the kernel size in the spatial domain. It was proved through the experiment of shallow networks and deep layered networks, and it showed competitive performance. Furthermore, the efficient Fourier CNN was newly proposed by applying a random kernel, using the principle of compression sensing to Fourier CNN. In the previous experiment, we obtained an accuracy very similar to that of the existing shallow network in the spatial domain, even if the simple scalar was learned. In conclusion, we have proposed a method to train image classification with a small number of parameters through the convolutional neural network in the Fourier domain. Furthermore, the advantage of the proposed kernel method is that it can be applied to any architectures based on CNNs, and exponentially reduces the number of weighting parameters for the convolutional layer. Future research is expected to develop a method for learning well, even in deep neural networks, and improving the speed aspect by taking advantage of point-wise multiplication in the Fourier domain, which requires much less computation than convolution in the spatial domain.