Deep Learning-Based Banknote Fitness Classification Using the Reflection Images by a Visible-Light One-Dimensional Line Image Sensor

In automatic paper currency sorting, fitness classification is a technique that assesses the quality of banknotes to determine whether a banknote is suitable for recirculation or should be replaced. Studies on using visible-light reflection images of banknotes for evaluating their usability have been reported. However, most of them were conducted under the assumption that the denomination and input direction of the banknote are predetermined. In other words, a pre-classification of the type of input banknote is required. To address this problem, we proposed a deep learning-based fitness-classification method that recognizes the fitness level of a banknote regardless of the denomination and input direction of the banknote to the system, using the reflection images of banknotes by visible-light one-dimensional line image sensor and a convolutional neural network (CNN). Experimental results on the banknote image databases of the Korean won (KRW) and the Indian rupee (INR) with three fitness levels, and the Unites States dollar (USD) with two fitness levels, showed that our method gives better classification accuracy than other methods.


Introduction
The functionalities of sorting and classifying paper currency in automated transaction facilities, such as automated teller machines (ATMs) or counting machines consist of the recognition of banknote types, denominations, counterfeit detection, serial recognition, and fitness classification [1]. The fitness classification of banknotes is concerned with the evaluation of the banknotes' physical conditions, such as staining, tearing, or bleaching. This task helps not only to determine whether a banknote is suitable for recirculation or should be replaced by a new one, but also to enhance the processing speed and sorting accuracy of the counting system.
Fitness of banknotes is normally classified based on the banknotes' optical characteristics captures by imaging sensors. In general, the presentations of banknotes are different among types of banknotes as well as between front and back sides of the banknote itself. As a result, fitness classification of banknote proposed in most previous studies was performed under the assumption that the input banknote's type, denomination, and input direction are known [1]. In the next Section, we provide detailed explanations of the related work concerning banknote fitness classification.

Related Works
Studies on banknote fitness classification with regard to various paper currencies have been reported. According to the research by the Dutch central bank, De Nederlandsche Bank (DNB), based on the evaluation using color imaging, soiling was the predominant reason that degrades the quality of a banknote, and the mechanical defects appeared after the banknote was stained [2][3][4]. Therefore, several previous studies use the soiling level as the criterion for judging the fitness for further circulation of a banknote [5]. Based on the method of using banknote images captured by single or multiple sensors, these approaches can be divided into two categories: the methods using the whole banknote image and those that use certain regions of interest (ROIs) on the banknote image for the classification of banknote fitness. In the method proposed by Sun and Li [6], they considered that the banknotes with different levels of old and new have different gray histograms. Therefore, they used the characteristics of the banknote images' histogram as the features, dynamic time warp (DTW) for histogram alignment, and support vector machine (SVM) for classifying the banknotes' age. Histogram features were also used in the research of He et al. [7], in which they used a neural network (NN) as the classifier. A NN was also used in the Euro banknote recognition system proposed by Aoba et al. [8]. In this study, the whole banknote images captured by visible and infrared (IR) sensors were converted to multiresolutional input values and subsequently fed to the classification part using a three-layered perceptron and the validation part uses the radial basis function (RBF) networks [8]. In this system, the new and dirty Euro banknotes are classified in the RBF network-based validation part. Recently, Lee et al. [9] proposed a soiled banknote determination based on morphological operations and Otsu's thresholding on contact image sensor (CIS) images of banknotes.
In ROI-based approaches, certain areas on the banknote images where the degradation can be frequently detected or visualized are selected for evaluating the fitness of the banknote. In the studies of Geusebroek et al. [3] and Balke et al. [10], from overlapping rectangular regions on the color images of Euro banknotes, the mean and standard deviation of the channels' intensity values were calculated and selected as the features for assessing the soiling values of banknotes using the AdaBoost algorithm [3,10]. Mean and standard deviation values of the wavelet-transformed ROIs were also the classification features in the method proposed by Pham el al. [11]. In this study, these features were extracted from the little textures containing areas on the banknote images using discrete wavelet transform (DWT) and selected based on a correlation with the densitometer data and subsequently used for fitness classification by the SVM [11]. The regions with the least amount of textures are also selected for feature extraction in the study proposed by Kwon et al. [12], in which they used both the features extracted from visible-light reflection (VR) and near-infrared light transmission (NIRT) images of the banknotes, and the fuzzy-based classifier for the fitness classification system.
The methods that are based on certain regions on the banknotes for evaluating the fitness of banknotes have advantages of reduced input data size and processing time. However, the selection of ROIs in the previous fitness classification studies is mostly manual, and the degradation and damage of banknote can occur on the unselected areas. The global-feature-based banknote images could help to solve this problem, but since the input features are mostly based on the brightness characteristic of the banknote images, it is much affected by illumination change, wavelength of sensors, and variation in patterns of different banknote types. Moreover, in fitness classifications, most studies assumed that the input banknote's type, denomination, and input direction are known [1].
To overcome these shortcomings, we considered a method for classification of banknote fitness based on the convolutional neural network (CNN). This NN structure was first introduced by LeCun et al. in their studies about handwritten character recognition [13,14], and have recently been emerging and attracting research interest [15], especially for the image classification of the ImageNet large-scale visual recognition challenge (ILSVRC) contest [16][17][18][19]. However, little research has been conducted on the automatic sorting of banknotes using CNNs. Ke et al. proposed a banknote image defect detection method using a CNN [20]; however, this study had only focused on the recognition of ink dots in banknote image defects, and did not specify the type of experimental banknote image dataset or judge the fitness for recirculation of the examined banknotes. Another recent CNN-based method proposed by Pham et al. [21] aiming to classify banknote type, denomination, and input direction showed good performance even with the mixed dataset from multiple national currencies. On the evaluation of a state-of-the-art method, we proposed a deep learning-based banknote fitness-classification method using a CNN on the gray-scale banknote images captured by visible-light one-dimensional line image sensor. Our proposed system is designed to classify the fitness of banknote into two or three levels including: (i) fit and unfit, and (ii) fit, normal and unfit for recirculation, depending on the banknote's country of origin, and regardless of the denomination and input direction of the banknote. Compared to previous studies, our proposed method is novel in the following aspects: (1) This is the first CNN-based approach for banknote fitness classification. We performed training and testing of a CNN on banknote image databases of three national currencies that consist of 12 denominations, by which the performance of our proposed method is confirmed to be robust to a variety of banknote types. (2) Our study carried out fitness determination on the United States dollar (USD), the Korean won (KRW), and the Indian rupee (INR), in which three levels of fitness of banknote, namely fit, normal, and unfit cases for recirculation, are considered with the KRW and INR, whereas two levels of fit and unfit cases are considered with the USD. (3) Our fitness recognition system can classify the fitness of banknote regardless of the denomination and direction of the input banknote. As a result, the pre-classification of banknote image in the denomination and input direction is not required, and there is only one trained fitness-classification model for each national currency. (4) We made our trained CNN model with databases publicly available by other researchers for the fair comparisons with our method and databases. Table 1 gives a comparison between our research and previous studies. The details of the proposed banknote fitness-classification method are presented in Section 3. Experimental results and conclusions are given in Sections 4 and 5 of this paper, respectively. Using DWT for feature extraction from ROIs on visible-light images of banknotes and classifying fitness by SVM [11]. 3.
Using fuzzy system for fitness determination based on ROIs on VR and NIRT images of banknotes [12].
Less resource requirement owing to the small sizes of processing areas and features.
Defects and damages can occur on the non-selected regions of the banknote.
Using the whole banknote image 1.
Using the gray-scale histogram of banknote images and classify fitnessusing DTW and SVM [6] or using an NN [7]. 2.
Using multiresolutional features of visible and IR images of banknote for recognition [8].

3.
Soiling evaluation based on using image morphological operations and Otsu's thresholding on banknote images [9].
Make use of all the available characteristics of banknote images for fitness classification.
-Possible data redundancy at the input stage. -Histogram-based methods are affected by imaging conditions and variations in banknote patterns -Pre-classification of banknote's denomination and input direction is required.
Fitness classification using a CNN (Proposed method) Pre-classification of banknote's denomination and input direction is not required.
Intensive training of the CNN is required.

Overview of the Proposed Method
The overall flowchart of the proposed method is shown in Figure 1. The input banknote image is captured and pre-processed. In this pre-processing step, the banknote region in the captured image by visible-light one-dimensional line image sensor is segmented from the background and resized to achieve the same size of 115 × 51 pixels, because the size of the input image to the CNN should be the same. The size-normalized image of the banknote is fed into the pre-trained CNN, and the level of fitness is determined at the output of the network.

Overview of the Proposed Method
The overall flowchart of the proposed method is shown in Figure 1. The input banknote image is captured and pre-processed. In this pre-processing step, the banknote region in the captured image by visible-light one-dimensional line image sensor is segmented from the background and resized to achieve the same size of 115 × 51 pixels, because the size of the input image to the CNN should be the same. The size-normalized image of the banknote is fed into the pre-trained CNN, and the level of fitness is determined at the output of the network.

Acquisition and Pre-Processing of Banknote Image
For banknote image acquisition in this study, we used a commercial banknote counting machine with a visible-light one-dimensional line image sensor that has a resolution of 1584 pixels [12,22]. A line sensor was used instead of the conventional two-dimensional (area) image sensors because of the size limitation and the cost of the counting machine. When a banknote is input to the system, it will be passed through the rollers inside the machine and illuminated by visible-light light-emitting diode (LED), and the line sensor is triggered successively at a high speed to capture the line images of the input banknote. The number of trigger times when the input banknote is a KRW or INR is 464, meanwhile that in the case of the USD it is 350. By concatenating the captured line images, the resulting acquired banknote image has a resolution of 1584 × 464 pixels or 1584 × 350 pixels in the case of the KRW-INR banknote or the USD banknote, respectively.
Four input directions of the banknotes when being inserted into the counting machine are labeled as A, B, C, and D, which are the front side in the forward direction, front side in the backward direction, back side in the forward direction, and back side in the backward direction, respectively. Examples of banknote images in the A to D directions in the case of the KRW are shown in Figure 2. The original banknote image captured by the counting machine includes both the banknote region and surrounding background. By using the corner detection algorithm built into the counting machine, we segment the banknote region from the background to address the area that contains meaningful information of the banknote image, as well as fix the displacement and rotation of the input banknote, as shown in Figure 2. The detail explanations of the corner detection algorithm are as follows. Within the fixed ROI of the captured banknote image of Figure 2a-d, the upper boundary of banknote is detected by scanning a one-dimensional mask for edge detection based on the 1st order derivative [23] from upper to lower position per each horizontal position of the ROI. From this, the candidate points of upper boundary are detected, and accurate boundary line is determined by line fitting algorithm [23] with these points. Same procedure is iterated for detecting lower, left, and right boundaries of banknote. Left boundary is detected by scanning the same mask from left to right position per each vertical position of ROI for detecting left boundary whereas right one is detected by scanning same mask from right to left position per each vertical

Acquisition and Pre-Processing of Banknote Image
For banknote image acquisition in this study, we used a commercial banknote counting machine with a visible-light one-dimensional line image sensor that has a resolution of 1584 pixels [12,22]. A line sensor was used instead of the conventional two-dimensional (area) image sensors because of the size limitation and the cost of the counting machine. When a banknote is input to the system, it will be passed through the rollers inside the machine and illuminated by visible-light light-emitting diode (LED), and the line sensor is triggered successively at a high speed to capture the line images of the input banknote. The number of trigger times when the input banknote is a KRW or INR is 464, meanwhile that in the case of the USD it is 350. By concatenating the captured line images, the resulting acquired banknote image has a resolution of 1584 × 464 pixels or 1584 × 350 pixels in the case of the KRW-INR banknote or the USD banknote, respectively.
Four input directions of the banknotes when being inserted into the counting machine are labeled as A, B, C, and D, which are the front side in the forward direction, front side in the backward direction, back side in the forward direction, and back side in the backward direction, respectively. Examples of banknote images in the A to D directions in the case of the KRW are shown in Figure 2. The original banknote image captured by the counting machine includes both the banknote region and surrounding background. By using the corner detection algorithm built into the counting machine, we segment the banknote region from the background to address the area that contains meaningful information of the banknote image, as well as fix the displacement and rotation of the input banknote, as shown in Figure 2. The detail explanations of the corner detection algorithm are as follows. Within the fixed ROI of the captured banknote image of Figure 2a-d, the upper boundary of banknote is detected by scanning a one-dimensional mask for edge detection based on the 1st order derivative [23] from upper to lower position per each horizontal position of the ROI. From this, the candidate points of upper boundary are detected, and accurate boundary line is determined by line fitting algorithm [23] with these points. Same procedure is iterated for detecting lower, left, and right boundaries of banknote. Left boundary is detected by scanning the same mask from left to right position per each vertical position of ROI for detecting left boundary whereas right one is detected by scanning same mask from right to left position per each vertical position of ROI for detecting right boundary. Then, four boundary lines are located, and the four intersected points by these lines are determined as the corner points of banknote. The segmented banknote images are then resized equally to achieve the same size of 115 × 51 pixels to be inputted to the CNN in the next step. position of ROI for detecting right boundary. Then, four boundary lines are located, and the four intersected points by these lines are determined as the corner points of banknote. The segmented banknote images are then resized equally to achieve the same size of 115 × 51 pixels to be inputted to the CNN in the next step.

The CNN Architecture
The CNN architecture used in our proposed method is shown in Figure 3 and Table 2. This network structure consists of five convolutional layers, denoted as C1 to C5, followed by three fully connected layers, denoted as F1 to F3, which are similar to those in the AlexNet architecture [16,21]. For faster training time with gradient descent, rectified linear unit (ReLU) layers are presented at all of the convolutional layers and fully connected layers of the network [16]. Using the ReLU activation function, whose formula is shown in Equation (1), instead of the standard non-linear function of the sigmoid or hyperbolic tangent, as shown in (2) and (3), respectively, can help to avoid the gradient-vanishing effect [24]: Local response normalization is considered at the first two layers of Conv1 and Conv2 with cross-channel normalization (CCN) layers [16,21], whose equation is presented follows:

The CNN Architecture
The CNN architecture used in our proposed method is shown in Figure 3 and Table 2. This network structure consists of five convolutional layers, denoted as C1 to C5, followed by three fully connected layers, denoted as F1 to F3, which are similar to those in the AlexNet architecture [16,21]. For faster training time with gradient descent, rectified linear unit (ReLU) layers are presented at all of the convolutional layers and fully connected layers of the network [16]. Using the ReLU activation function, whose formula is shown in Equation (1), instead of the standard non-linear function of the sigmoid or hyperbolic tangent, as shown in (2) and (3), respectively, can help to avoid the gradient-vanishing effect [24]: where , is the neuron activity computed by applying the kernel ith at position (x, y). With the normalization executed for the adjacent n kernel maps at the same spatial position, the obtained normalized activity value is , . In Equation (4), N is the total number of kernels in the layer. We choose a window channel size n of 5; k, α, and β are hyper-parameters and are set to 1, 0.0001, and 0.75, respectively. In Equation (4), the term of summation of , multiplied by α can be zero in case that all the , are zero. Therefore, the off-set value of k is used in order to make the denominator of Equation (4) non-zero. α is the kind of control parameter. For example, if the term of summation of , multiplied by α is much larger than k, , of Equation (4) approximates , /(the term of summation of , multiplied by α) by ignoring k. On the contrary, if the term of summation of , multiplied by α is much smaller than k, , of Equation (4) approximates , /k by ignoring the term of summation of , multiplied by α. β is also the kind of control parameter. With larger β, the , becomes smaller whereas the , becomes larger with smaller β. The k, α, and β are also called as hyper-parameters based on previous researches [16]. The optimal values (1, 0.0001, and 0.75) of these parameters were experimentally determined with training data.    Local response normalization is considered at the first two layers of Conv1 and Conv2 with cross-channel normalization (CCN) layers [16,21], whose equation is presented follows: where a i x,y is the neuron activity computed by applying the kernel ith at position (x, y). With the normalization executed for the adjacent n kernel maps at the same spatial position, the obtained normalized activity value is a i x,y . In Equation (4), N is the total number of kernels in the layer. We choose a window channel size n of 5; k, α, and β are hyper-parameters and are set to 1, 0.0001, and 0.75, respectively. In Equation (4) β, the a i x,y becomes smaller whereas the a i x,y becomes larger with smaller β. The k, α, and β are also called as hyper-parameters based on previous researches [16]. The optimal values (1, 0.0001, and 0.75) of these parameters were experimentally determined with training data. Following each CNN layer in the first and second convolutional layer is the max pooling layer. The max pooling is also adopted in the last convolutional layer (C5) before connecting to the fully connected layer part of the network structure. The gray-scale banknote images in our proposed method are resized equally to 115 × 51 pixels using linear interpolation before being fed into the CNN. Through each layer of the network structure, feature map size changes are as shown in Table 2 according to the following equations [21,25]: for ith convolutional layer (7) where w i , h i , and c i , denoting the width, height, and number of channels, respectively, are the sizes of the feature map in the ith convolutional layer in pixels; those of its preceding (i − 1)th layer are denoted as w i−1 , h i−1 , and c i−1 ; the ith layer has k filters with the number of weights per filter is , the filtering stride is s pixels, and the zero-padding amount is p pixels. The resulting banknote feature map after five convolutional layers has the size of 6 × 2 × 128 = 1536, as shown in Table 2, and these features are fed into the fully connected layers of the network.
To prevent the overfitting problem, we inserted a dropout layer between the 2nd and 3rd fully connected layers, as shown in Table 2. This is the regularization method that randomly disconnects the neuron unit from the network during training [16,26]. p is the probability of maintaining the connections. For example, if there are 100 connections of the neuron unit from the network, 35 connections are randomly disconnected with the p of 0.65 (the connections of 65% are maintained). In this research, we chose p equal to 0.65. The optimal value (0.65) of p was experimentally determined with training data. In order to do so, the input vector y to the network node is element-wise multiplied with a vector r consisting of the independent Bernoulli random variables, each of which can be 0 or 1 with the probability p [26]. Therefore, r~Bernoulli(p) [26]. For example, if y of Equation (8) has the 100 components of (y 1 , y 2 , . . . , y 100 ), the r has the 100 components of (r 1 , r 2 , . . . , r 100 ), also, for the element-wise multiplication of y and r ("•" of Equation (8)). If the probability p is 0.65, 65 components of (r 1 , r 2 , . . . , r 100 ) are 1 and the remained 35 ones are 0. z of Equation (8) stands for the output of feed-forward operation of the neuron unit with dropout, activation function f (·), weights of w, and bias b: As mentioned above, banknote features are completely extracted at the output of the final 5th convolutional layer. The fully connected layers that follow can be considered as the classifier part of the CNN structure. The number of network nodes in the three fully connected layers (F1 to F3) in our study is shown in Table 2. In this research, we classified banknote fitness to three levels in the case of the KRW and INR, and two levels for the USD banknotes. As a result, the number of nodes in the last fully connected layer may vary according to the national currency selected.
At the output stage of the CNN structure, we apply a normalized exponential function (softmax function) [27] that helps to transform the real values at the outputs of the neuron units in F3 to the values in the range of (0, 1). These resulting values of the softmax function can be considered as the probability that the input banknote belongs to the fitness classes corresponding to the network outputs. The softmax layer can also help to highlight the largest values and suppress the smaller values among the set [21]. The formula of the softmax function applied on the node output values denoted as z i is shown in the following Equation (9): Among N fitness levels, the one corresponding to the maximum value of p i (i = 1, . . . , N) is considered as the fitness level of the input banknote image. In this research, the training process for the filter parameters of convolutional layers and the network weights of fully connected layers are conducted separately for each national currency of KRW, INR, and USD, in combination of all the denominations and input directions of the banknote images. By conducting this training on the CNN model, our proposed fitness-classification method does not require the pre-classification of the denomination type and direction of the banknote. The completely trained CNN models are stored in the memory for use in the testing experiments.

Experimental Results
We used banknote fitness databases from three national currencies, which are the KRW, INR, and USD, for the experiments using our proposed method. The KRW banknote image database is composed of banknotes in two denominations, 1000 and 5000 wons. The denominations of banknotes in the INR database are 10, 20, 50,100, 500, and 1000 rupees. Those for the case of the USD are 5, 10, 50, and 100 dollars. Three levels of fitness, which are fit, normal, and unfit for recirculation, are assigned for the banknotes of each denomination in the cases of the KRW and INR, and two levels including fit and unfit are defined for the USD banknotes in the experimental dataset. Examples of banknotes assigned to each fitness level are shown in  probability that the input banknote belongs to the fitness classes corresponding to the network outputs. The softmax layer can also help to highlight the largest values and suppress the smaller values among the set [21]. The formula of the softmax function applied on the node output values denoted as zi is shown in the following Equation (9):

Experimental Results
We used banknote fitness databases from three national currencies, which are the KRW,   probability that the input banknote belongs to the fitness classes corresponding to the network outputs. The softmax layer can also help to highlight the largest values and suppress the smaller values among the set [21]. The formula of the softmax function applied on the node output values denoted as zi is shown in the following Equation (9)

Experimental Results
We used banknote fitness databases from three national currencies, which are the KRW, The number of banknotes in each fitness level of three national currency databases is given in Table 3. We made our trained CNN model with databases publicly available by other researchers through [28] for the fair comparisons with our method and databases. We conducted the experiments using the two-fold cross-validation method. Therefore, the dataset of banknote images from each national currency was randomly divided into two parts. In the first trial, one of the two parts was used for training, and the other was used for testing. The process was repeated with these parts of the dataset swapped in the second trial. With the obtained results from two trials, we calculated the overall performance by averaging two accuracies.
In this research, we trained the network models separately for each national currency dataset without pre-classifying the denomination and input direction of the banknote images in the dataset. In each dataset, we performed data augmentation for expanding the number or image for training. This process helps to generalize the training data and reduce overfitting [21]. For data augmentation, we randomly cropped the boundaries of the original image in the dataset in the range of 1 to 7 pixels. The number of images in the datasets of the KRW and INR were increased by multiplication factors of 3 and 6 times, respectively. In the case of the USD, the numbers of fit and unfit banknote images were multiplied by 21 and 71 times. Consequently, the total number of images for training in each national currency dataset was approximately 100,000 images. We also listed the number of images in each dataset and each class after augmentation in Table 3.
In the first experiments of the CNN training, we trained three network models for fitness classification in each of the national currency dataset, and repeated it twice for two-fold cross-validation. Training and testing experiments were performed using the MATLAB implementation of the CNN [29] on a desktop computer equipped with an Intel ® Core™ i7-3770K CPU @ 3.50 GHz [30], 16-GB memory, and an NVIDIA GeForce GTX 1070 graphics card with 1920 CUDA cores, and 8-GB GDDR5 memory [31]. The training method is stochastic gradient descent The number of banknotes in each fitness level of three national currency databases is given in Table 3. We made our trained CNN model with databases publicly available by other researchers through [28] for the fair comparisons with our method and databases. We conducted the experiments using the two-fold cross-validation method. Therefore, the dataset of banknote images from each national currency was randomly divided into two parts. In the first trial, one of the two parts was used for training, and the other was used for testing. The process was repeated with these parts of the dataset swapped in the second trial. With the obtained results from two trials, we calculated the overall performance by averaging two accuracies.
In this research, we trained the network models separately for each national currency dataset without pre-classifying the denomination and input direction of the banknote images in the dataset. In each dataset, we performed data augmentation for expanding the number or image for training. This process helps to generalize the training data and reduce overfitting [21]. For data augmentation, we randomly cropped the boundaries of the original image in the dataset in the range of 1 to 7 pixels. The number of images in the datasets of the KRW and INR were increased by multiplication factors of 3 and 6 times, respectively. In the case of the USD, the numbers of fit and unfit banknote images were multiplied by 21 and 71 times. Consequently, the total number of images for training in each national currency dataset was approximately 100,000 images. We also listed the number of images in each dataset and each class after augmentation in Table 3.
In the first experiments of the CNN training, we trained three network models for fitness classification in each of the national currency dataset, and repeated it twice for two-fold cross-validation. Training and testing experiments were performed using the MATLAB implementation of the CNN [29] on a desktop computer equipped with an Intel ® Core™ i7-3770K CPU @ 3.50 GHz [30], 16-GB memory, and an NVIDIA GeForce GTX 1070 graphics card with 1920 The number of banknotes in each fitness level of three national currency databases is given in Table 3. We made our trained CNN model with databases publicly available by other researchers through [28] for the fair comparisons with our method and databases. We conducted the experiments using the two-fold cross-validation method. Therefore, the dataset of banknote images from each national currency was randomly divided into two parts. In the first trial, one of the two parts was used for training, and the other was used for testing. The process was repeated with these parts of the dataset swapped in the second trial. With the obtained results from two trials, we calculated the overall performance by averaging two accuracies.
In this research, we trained the network models separately for each national currency dataset without pre-classifying the denomination and input direction of the banknote images in the dataset. In each dataset, we performed data augmentation for expanding the number or image for training. This process helps to generalize the training data and reduce overfitting [21]. For data augmentation, we randomly cropped the boundaries of the original image in the dataset in the range of 1 to 7 pixels. The number of images in the datasets of the KRW and INR were increased by multiplication factors of 3 and 6 times, respectively. In the case of the USD, the numbers of fit and unfit banknote images were multiplied by 21 and 71 times. Consequently, the total number of images for training in each national currency dataset was approximately 100,000 images. We also listed the number of images in each dataset and each class after augmentation in Table 3.
In the first experiments of the CNN training, we trained three network models for fitness classification in each of the national currency dataset, and repeated it twice for two-fold cross-validation.
Training and testing experiments were performed using the MATLAB implementation of the CNN [29] on a desktop computer equipped with an Intel ® Core™ i7-3770K CPU @ 3.50 GHz [30], 16-GB memory, and an NVIDIA GeForce GTX 1070 graphics card with 1920 CUDA cores, and 8-GB GDDR5 memory [31]. The training method is stochastic gradient descent (SGD), also known as sequential gradient descent, in which the network parameters are updated based on the batch of data points at a time [27]. The CNN training parameters were set as follows: the number of iterations for training is 60 epochs, with the initial learning rate of 0.01 and reduced by 10% at every 20 epochs. The convergence graphs of the average batch loss and accuracy according to the epoch number of the training process on the two subsets of training data in the two-fold cross-validation are shown in Figure 7 for each country's banknote dataset. Figure 7 shows that the accuracy values increased to 100% and the loss curves approach zero with the increment of epoch number in all cases. CUDA cores, and 8-GB GDDR5 memory [31]. The training method is stochastic gradient descent (SGD), also known as sequential gradient descent, in which the network parameters are updated based on the batch of data points at a time [27]. The CNN training parameters were set as follows: the number of iterations for training is 60 epochs, with the initial learning rate of 0.01 and reduced by 10% at every 20 epochs. The convergence graphs of the average batch loss and accuracy according to the epoch number of the training process on the two subsets of training data in the two-fold cross-validation are shown in Figure 7 for each country's banknote dataset. Figure 7 shows that the accuracy values increased to 100% and the loss curves approach zero with the increment of epoch number in all cases. In Figure 8, we show the 96 trained filters in the first convolutional layers of the trained CNN models for each national currency dataset using two-fold cross-validation. For visualization, the In Figure 8, we show the 96 trained filters in the first convolutional layers of the trained CNN models for each national currency dataset using two-fold cross-validation. For visualization, the original 7 × 7 × 1 pixel filters were resized by a factor of 5 and the weight values were scaled to the range of unsigned integer number from 0 to 255, corresponding to the gray-scale image intensity values. With the trained CNN models, we conducted the testing experiments on the datasets of each national currency, in a combination of all the denominations and input directions of the banknote images. The experimental results of the two-fold cross-validation using CNN for each dataset are shown in Tables 4-6, and expressed as the confusion matrices between the desired and predicted outputs, namely the actual fitness levels of the banknotes and the fitness-classification results using the trained CNN models. From the testing results on two subsets, we calculated the average accuracy based on the number of accurately classified cases of each subset as the following formula [32]: With the trained CNN models, we conducted the testing experiments on the datasets of each national currency, in a combination of all the denominations and input directions of the banknote images. The experimental results of the two-fold cross-validation using CNN for each dataset are shown in Tables 4-6, and expressed as the confusion matrices between the desired and predicted outputs, namely the actual fitness levels of the banknotes and the fitness-classification results using the trained CNN models. From the testing results on two subsets, we calculated the average accuracy based on the number of accurately classified cases of each subset as the following formula [32]: with Avr_Acc the average testing accuracy of the total N samples in the dataset, and GA 1 and GA 2 are the number of accurately classified samples (genuine acceptance cases) from the 1st and 2nd fold cross validations, respectively. Table 4. Confusion matrices of testing results on the KRW banknote fitness dataset using the proposed method. The 1st Testing Results and 2nd Testing Results mean the results of the testing on the 1st and 2nd subsets of banknote images in the two-fold cross-validation method, respectively (unit: %). Average Accuracy 99.637 Table 6. Confusion matrices of the testing results on the USD banknote fitness dataset using the proposed method. The 1st Testing Results and 2nd Testing Results mean the same as those in Table 4 (unit: %).  Tables 4-6 show that the proposed CNN-based method yields good performance with the average testing accuracy of the two-fold cross-validation of approximately 97% in the cases of the KRW and USD, and more than 99% in the case of the INR, even with the merged denominations and input directions of banknote images in each dataset.

Fit Unfit
In Figure 9, we show the examples of correctly classified cases in the testing results using our proposed method on the KRW, INR, and USD datasets. Figure 9 shows that the degradation degrees in the INR banknotes are clearer to be distinguished among fitness classes of fit, normal, and unfit than that in the case of the KRW. Furthermore, the visible-light banknote images captured in the case of the USD have slightly lower brightness than those of the KRW and INR. This resulted in the highest average classification accuracy in the testing results using our proposed method on the INR dataset compared to that of the KRW and USD. Examples of error cases are also given in Figures 10-12 for each of the national currency datasets. As shown in these figures, there were some cases where the input banknotes were incorrectly segmented from the background, as shown in Figures 10a and 11d. This resulted in the banknotes being classified as the classes of lower fitness level. Figures 10c and 11c show that the stained and soiled areas occurred sparsely on the banknotes and occasionally could not be recognized by using only visible-light images as in our method. Banknote images in Figure 11a,b are from the fit and normal classes, respectively; however, besides the similar brightness, both of the banknotes were slightly folded on the upper parts, which affected the classification results. The fit USD banknote in Figure 12a has hand-written marks, whereas the degradation on the unfit banknote in Figure 12b is the fading of texture in the middle of the banknote rather than staining or soiling. These reasons caused the misclassification of fitness level in these cases. In addition, the Examples of error cases are also given in Figures 10-12 for each of the national currency datasets. As shown in these figures, there were some cases where the input banknotes were incorrectly segmented from the background, as shown in Figures 10a and 11d. This resulted in the banknotes being classified as the classes of lower fitness level. Figures 10c and 11c show that the stained and soiled areas occurred sparsely on the banknotes and occasionally could not be recognized by using only visible-light images as in our method. Banknote images in Figure 11a,b are from the fit and normal classes, respectively; however, besides the similar brightness, both of the banknotes were slightly folded on the upper parts, which affected the classification results. The fit USD banknote in Figure 12a has hand-written marks, whereas the degradation on the unfit banknote in Figure 12b is the fading of texture in the middle of the banknote rather than staining or soiling. These reasons caused the misclassification of fitness level in these cases. In addition, the average classification accuracy of the normal banknotes was the least among the three fitness levels in the case of INR and KRW. This is because of the fact that, the normal banknotes have the middle quality levels, which consist of stained or partly damaged more than fit banknotes but not enough to be replaces by the new ones as the cases of unfit banknotes. This resulted in the largest confusions occurring between normal class and either the fit or unfit classes, and the average classification accuracies in the cases of normal classes in both INR and KRW datasets were the least. In the subsequent experiments, we compared the performance of the proposed method with that of the previous studies reported in [7,11]. As both of the previous methods required training, we also performed the two-fold cross-validation in the comparative experiments. Referring to [7], we extracted the features from the gray-level histogram of the banknote image and used the multilayered perceptron (MLP) network as the classifiers, with 95 network nodes in the input and hidden layers. In the case of the comparative experiments using the method in [11], we selected the areas that contain less texture on the banknote images as ROIs, and calculated the means and standard deviation values of the ROIs' Daubechies wavelet decomposition. Because the fitness classifiers in [11] are the SVM, in the case of the KRW and INR datasets that have three fitness levels, we trained the SVM models using the one-against-all strategy [33]. The experiments with previous methods were implemented using MATLAB toolboxes [34,35]. In the subsequent experiments, we compared the performance of the proposed method with that of the previous studies reported in [7,11]. As both of the previous methods required training, we also performed the two-fold cross-validation in the comparative experiments. Referring to [7], we extracted the features from the gray-level histogram of the banknote image and used the multilayered perceptron (MLP) network as the classifiers, with 95 network nodes in the input and hidden layers. In the case of the comparative experiments using the method in [11], we selected the areas that contain less texture on the banknote images as ROIs, and calculated the means and standard deviation values of the ROIs' Daubechies wavelet decomposition. Because the fitness classifiers in [11] are the SVM, in the case of the KRW and INR datasets that have three fitness levels, we trained the SVM models using the one-against-all strategy [33]. The experiments with previous In the subsequent experiments, we compared the performance of the proposed method with that of the previous studies reported in [7,11]. As both of the previous methods required training, we also performed the two-fold cross-validation in the comparative experiments. Referring to [7], we extracted the features from the gray-level histogram of the banknote image and used the multilayered perceptron (MLP) network as the classifiers, with 95 network nodes in the input and hidden layers. In the case of the comparative experiments using the method in [11], we selected the areas that contain less texture on the banknote images as ROIs, and calculated the means and standard deviation values of the ROIs' Daubechies wavelet decomposition. Because the fitness classifiers in [11] are the SVM, in the case of the KRW and INR datasets that have three fitness levels, In the subsequent experiments, we compared the performance of the proposed method with that of the previous studies reported in [7,11]. As both of the previous methods required training, we also performed the two-fold cross-validation in the comparative experiments. Referring to [7], we extracted the features from the gray-level histogram of the banknote image and used the multilayered perceptron (MLP) network as the classifiers, with 95 network nodes in the input and hidden layers. In the case of the comparative experiments using the method in [11], we selected the areas that contain less texture on the banknote images as ROIs, and calculated the means and standard deviation values of the ROIs' Daubechies wavelet decomposition. Because the fitness classifiers in [11] are the SVM, in the case of the KRW and INR datasets that have three fitness levels, we trained the SVM models using the one-against-all strategy [33]. The experiments with previous methods were implemented using MATLAB toolboxes [34,35].
A comparison of the experimental results between our proposed method and those in previous studies are shown in Tables 7-9, in which the fitness-classification accuracies are calculated separately according to denominations and input directions of the banknote images in each national currency. This is because in the previous studies, the fitness-classification models were trained with these manually separated type banknote images. Therefore, although our proposed method does not require the pre-classification of denominations and input directions of the banknote images, we showed the accuracies separately according to these categories for comparison.  Tables 7-9 show that the proposed CNN-based fitness classification method outperformed the previous methods in terms of higher average classification accuracy for all the national currency datasets. This can be explained by the disadvantages of each method: the histogram-based method used only the overall brightness characteristic of the banknote images for the classification of fitness levels. This feature was strongly affected by the capturing condition of the sensors. Moreover, degradation might occur sparsely on the banknote, therefore it cannot be easily recognized by the brightness histogram only. The ROI-based method in [11] relied only on the less textured areas on the banknote images. Consequently, if the degradation or damage of the banknote occurs on other areas, it will not be as effective as the proposed method. The CNN-based method has the advantage of the ability to train not only the classifier in the fully connected layer parts but also the filter weights in the convolutional layers, which can be considered as the feature extraction part. As a result, both the feature extraction and classification stages were intensively trained by the training datasets. Moreover, when the whole banknote image is inputted to the CNN architecture, we can make use of all of the available optical characteristics of the banknote for feature extraction. Consequently, owning to the advantages in the feature extraction procedure, the proposed fitness-classification method gave better performance compared to previous methods in terms of higher average accuracy using two-fold cross-validation.  Table 9. Comparison of fitness-classification accuracy by our proposed method with that of previous studies on the USD banknote dataset. Denom., Dir., 1st Testing Results and 2nd Testing Results mean the same as those in Table 7 (unit: %).

Conclusions
This study proposed a fitness-classification method using visible-light banknote images and CNN. The fitness level of the banknotes is assigned to three levels for the cases of the KRW and INR, and two levels for the USD banknotes. Our proposed method is designed to classify fitness level regardless of the denominations and input directions of the banknote images. The experimental results on the three datasets of the KRW, INR, and USD banknote images with merged denominations and input directions gave good performances, and showed that the proposed method outperformed the methods in the previous studies, in terms of higher average accuracy with two-fold cross-validation. For future work, we plan to test the proposed method with banknotes from other countries. We also intend to further study the multinational fitness-classification method, which is able to simultaneously recognize the fitness level of banknotes from multiple countries.