Pedestrian Detection Based on Adaptive Selection of Visible Light or Far-Infrared Light Camera Image by Fuzzy Inference System and Convolutional Neural Network-Based Verification

A number of studies have been conducted to enhance the pedestrian detection accuracy of intelligent surveillance systems. However, detecting pedestrians under outdoor conditions is a challenging problem due to the varying lighting, shadows, and occlusions. In recent times, a growing number of studies have been performed on visible light camera-based pedestrian detection systems using a convolutional neural network (CNN) in order to make the pedestrian detection process more resilient to such conditions. However, visible light cameras still cannot detect pedestrians during nighttime, and are easily affected by shadows and lighting. There are many studies on CNN-based pedestrian detection through the use of far-infrared (FIR) light cameras (i.e., thermal cameras) to address such difficulties. However, when the solar radiation increases and the background temperature reaches the same level as the body temperature, it remains difficult for the FIR light camera to detect pedestrians due to the insignificant difference between the pedestrian and non-pedestrian features within the images. Researchers have been trying to solve this issue by inputting both the visible light and the FIR camera images into the CNN as the input. This, however, takes a longer time to process, and makes the system structure more complex as the CNN needs to process both camera images. This research adaptively selects a more appropriate candidate between two pedestrian images from visible light and FIR cameras based on a fuzzy inference system (FIS), and the selected candidate is verified with a CNN. Three types of databases were tested, taking into account various environmental factors using visible light and FIR cameras. The results showed that the proposed method performs better than the previously reported methods.


Introduction
A number of studies are currently being conducted with a view to increasing the accuracy of the pedestrian detection schemes as intelligent surveillance systems are being advanced. In the past, visible light cameras were widely used [1][2][3][4][5][6][7], however, these cameras are quite vulnerable to factors such as varying shadows and lighting, and cannot accurately detect pedestrians during nighttime. To address such constraints, numerous studies on pedestrian detection systems using far-infrared (FIR) light cameras (thermal cameras) are being conducted [7][8][9][10]. However, pedestrian detection remains a difficult challenge as the differences between the pedestrian and the non-pedestrian areas decrease as the solar radiation causes the air temperature to reach the body temperature level. In order to address such issues, researchers have been exploring methods to use both visible light and FIR camera images. This includes a method of selecting the visible-light and thermal-infrared images under the dynamic environments as presented in [11], and a method of detecting the pedestrians by combining these two images [12][13][14].
However, these methods may increase the processing time and computational complexity as they have to take into account both visible light and FIR camera images, and process the convolutional neural network (CNN) twice [13]. In order to overcome these limitations, our research suggests a method that is able to detect the pedestrians under varying conditions. The proposed method is more reliable than a single camera-based method, reduces the complexity of the algorithm, and requires less processing time compared to the methods using both visible light and FIR camera images. This is because our method adaptively selects one candidate between two pedestrian candidates derived from visible light and FIR camera images based on a fuzzy inference system (FIS). To enhance the detection accuracy and processing speed, only the selected one candidate is verified by the CNN.
The scenario where our system can be applied is the pedestrian detection by intelligent surveillance cameras in outdoor environments. Therefore, all the experimental datasets were collected considering this environment as shown in Section 4.1. The detected position of pedestrians by our method at various times and in different environments can be used as basic information for face recognition, behavior recognition, and abnormal pedestrian case detection, which are necessary for crime and terror prevention, and the detection of emergency situations where a person suddenly falls down on the street and does not move. The following Section 2 looks extensively into various pedestrian detection scheme studies.

Related Works
The pedestrian detection studies that are available to date can be divided into two groups: (a) single camera-based methods (infrared or visible-light cameras) [6,10,[15][16][17][18][19][20][21][22], and (b) multiple camera-based methods [11][12][13][22][23][24]. The former group includes the following methods: (i) adaptive boosting (AdaBoost) cascade-based method, which is widely used as the representative facial detection scheme [25,26], (ii) histogram of oriented gradient-support vector machine (HOG-SVM) method [18], (iii) integral HOG [19] method, whose processing speed was reported to be significantly faster than the existing HOG, (iv) neural network-based method using the receptive field approach [27] for pedestrian detection [20], and (v) methods based on background generation with FIR cameras [21]. However, these single camera-based methods have a common constraint that their detection performance degrades when their surroundings vary. For instance, a visible light camera-based method barely detects the pedestrians during dark nights, and is affected by varying shadows and lighting. Similarly, an FIR camera-based method cannot detect the pedestrians when bright sunshine increases the ground temperature up to the body temperature level.
To address these issues, studies on CNN-based pedestrian detection are being conducted. John et al. used an FIR camera to study how to detect pedestrians based on adaptively fuzzy c-means clustering and CNN [10]. Considering the daytime and the nighttime conditions, the researchers suggested a more resilient algorithm. This work, however, did not include experiments under conditions where the aforementioned background air temperature was similar to that of the pedestrians. In the study of the pedestrian detection with a CNN [6], the authors showed that the large margin CNN method outperformed the SVM method in pedestrian detection using a visible light camera. However, this study did not include experiments on images under varying environmental factors, such as varying lighting and shadows. Such CNN-based pedestrian detection methods showed better performance compared to the previously studied methods while they still failed to overcome the limitations associated with the varying environmental conditions, such as, varying lighting and shadows, and the cases where the background had the same temperature as the pedestrians.
To address the above limitations, multiple camera-based detection methods were also being studied. In a study involving multi-cue pedestrian detection and moving vehicle tracking [23], the authors proposed a stereo visible light camera-based pedestrian detection method that employs shape and texture information. Bertozzi  based on tetra-vision using visible light and FIR camera images [24]. It used a vehicle's headlights and a combination of visible light and FIR camera images for pedestrian detection purposes. This method was validated for nighttime conditions, which took a longer time to process. Another study on a multi-spectral pedestrian detection method [22] using both visible light and near-infrared (NIR) light camera images was conducted using HOG-SVM. In contrast, Serrano-Cuerda et al. conducted a study on pedestrian detection systems under a more diverse environmental setting than the aforementioned studies [11]. As the detection performance of the cameras appeared vulnerable to the weather and environmental conditions, the study used confidence measures (based on the mean lighting and standard deviation information) to select the more appropriate images from visible light and FIR camera images.
Lee et al. combined visible-light and FIR camera-produced pedestrian data based on difference images, and suggested a method for detecting the pedestrians [12]. However, there exists a doubt that the cameras discussed in [11] and in [12] may have lower performance as no final verification was provided in those publications. In addition, Wagner et al. suggested two methods in their study [13]. The first method was an early fusion CNN method, which converged both the visible light and FIR images, that were fed to the CNN as inputs. The second method, called the late fusion CNN-based method, employed training of the pedestrian and the background domains (each from visible light and FIR images), and converging the features collected from the fully connected layers. Among the two, the latter showed a better performance. However, this method may increase the processing time and computational complexity as it has to take into account of visible light and FIR camera images, and process the CNN twice.
In order to overcome these limitations, this paper suggests a method that is able to detect the pedestrians under varying conditions. It is novel in the following three ways compared to the previously published works: - The proposed method is more reliable than a single camera-based method, reduces the complexity of the algorithm, and requires less processing time compared to the methods using both visible light and FIR camera images. This is because our method adaptively selects one candidate between two pedestrian candidates derived from visible light and FIR camera images based on a fuzzy inference system (FIS). - The two input features of FIS vary owing to the fact that the input candidate images are of the following types: pedestrian or non-pedestrian (background). Therefore, to remove such uncertainties, this study applies Gaussian fitting to the distribution of the gradient-based features of the input candidate images, and adds weights (resulting from such a fitted Gaussian distribution) to the FIS output. By doing so, it enables a more accurate and adaptive selection process for the FIS regardless whether the images were pedestrian type or non-pedestrian type. -It increases the accuracy of the pedestrian detection process by verifying the FIS-selected pedestrian candidate through the CNN. In addition, we have opened our database and trained CNN model to other researchers in order to compare the performances. Table 1 shows a comparison of the proposed and the previously researched pedestrian detection methods, including their respective advantages and disadvantages. The remainder of this paper consists of the following sections: Section 3 presents the details of the concepts behind the proposed system. The experimental results and various performance comparisons (among the existing methods) are presented in Section 4. Finally, Section 5 provides our conclusions.  [18,22], integral HOG [19], neural network based on receptive fields [20], and background generation [21] -More resilient in simple conditions. -Faster processing speed than multiple camera-based algorithm.
CNN-based method [6,10] More accurate than the past single camera-based method.

Stereo visible light cameras
Shape and texture information [23] Better detect pedestrians as it is able to utilize more information than the single camera-based method.
-Longer time to process as it has to process both the camera images. Visible light & NIR cameras HOG-SVM [22] Visible light & FIR cameras Tetra-vision-based HOG-SVM [24] Better night vision pedestrian detection inside the car.
-No performance without vehicle headlight. -High number of calculation is required as it needs to process two camera images.
Camera selection [11] Better performance under various conditions. -Detection capability is affected as it has no final verification process for the detected pedestrian area. Difference image-based fusion [12] Late fusion CNN-based method [13] Higher CNN-based detection accuracy.
-Processing hours and algorithm complexity increases as the method processes input from two camera images to conduct CNN twice.  Figure 1 describes the overall procedure of the proposed system. The system receives the data from both visible light and FIR light images through dual cameras (step (1) and Figure 2a). It detects the candidate based on background subtraction and noise reduction by using difference images (Figure 2b) between the background image and the input images [12]. Here, the mean value of the candidate within the difference image obtained from the visible light image is "feature 1", and that gained by the FIR light image is "feature 2". In general, the mean value of the difference images increases along with the increase of difference between the pedestrian and the background, which causes the consequent increment of possibility of correct pedestrian. However, as shown in Figure 2c, the output candidate exists not only in the red box (pedestrian candidate) but also in the yellow box (non-pedestrian candidate). Figure 1 describes the overall procedure of the proposed system. The system receives the data from both visible light and FIR light images through dual cameras (step (1) and Figure 2a). It detects the candidate based on background subtraction and noise reduction by using difference images (Figure 2b) between the background image and the input images [12]. Here, the mean value of the candidate within the difference image obtained from the visible light image is "feature 1", and that gained by the FIR light image is "feature 2". In general, the mean value of the difference images increases along with the increase of difference between the pedestrian and the background, which causes the consequent increment of possibility of correct pedestrian. However, as shown in Figure 2c, the output candidate exists not only in the red box (pedestrian candidate) but also in the yellow box (non-pedestrian candidate).   Figure 1 describes the overall procedure of the proposed system. The system receives the data from both visible light and FIR light images through dual cameras (step (1) and Figure 2a). It detects the candidate based on background subtraction and noise reduction by using difference images (Figure 2b) between the background image and the input images [12]. Here, the mean value of the candidate within the difference image obtained from the visible light image is "feature 1", and that gained by the FIR light image is "feature 2". In general, the mean value of the difference images increases along with the increase of difference between the pedestrian and the background, which causes the consequent increment of possibility of correct pedestrian. However, as shown in Figure 2c, the output candidate exists not only in the red box (pedestrian candidate) but also in the yellow box (non-pedestrian candidate).   As mentioned earlier, the pedestrian candidate usually has a high mean value in the difference image while the non-pedestrian candidate has a low mean value in the difference image as shown in Figure 2b. Nevertheless, because all the regions inside pedestrian candidate do not show high mean value in the difference image of Figure 2b, a low threshold value for image binarization should be used to correctly detect the whole regions inside pedestrian candidate, which causes the incorrect detection of non-pedestrian candidate as pedestrian one as shown in Figure 2c. It is difficult to As mentioned earlier, the pedestrian candidate usually has a high mean value in the difference image while the non-pedestrian candidate has a low mean value in the difference image as shown in Figure 2b. Nevertheless, because all the regions inside pedestrian candidate do not show high mean value in the difference image of Figure 2b, a low threshold value for image binarization should be used to correctly detect the whole regions inside pedestrian candidate, which causes the incorrect detection of non-pedestrian candidate as pedestrian one as shown in Figure 2c. It is difficult to correctly discriminate between the pedestrian and non-pedestrian candidates, and the FIS is designed using the mean value of the gradient magnitude of pedestrian or non-pedestrian candidate in difference images as "feature 3". The system adaptively selects a more appropriate candidate to be verified by the CNN between the two boxes of Figure 2c-after adding "feature 3" as weights, and using the FIS with "feature 1" and "feature 2" as an input (see step (3) of Figure 1). Then, it uses the selected candidates of pedestrian and non-pedestrian (Figure 2d) as the pre-trained input for the CNN to ultimately classify it into a pedestrian or non-pedestrian case (see step (4) of Figures 1 and 2e).

Adaptive Selection by FIS
The FIS in this paper is designed to adaptively select one candidate between two pedestrian candidates derived from visible light and FIR camera images, which is deemed most appropriate for the pedestrian detection process. Table 2 presents a fuzzy rule table designed through this research to be used for the FIS. This research uses two features, and has "Low" and "High" as inputs, and "Low" "Medium" and "High" as outputs. The two features consist of "feature 1", a mean value of the candidate gained from the visible light image, and "feature 2", a mean value from the FIR light image. That is because, in general, the bigger the difference between the pedestrian and the background is, the bigger the mean value in difference image is, meaning that the outcome is more likely to be the correct pedestrian.
For instance, as listed in Table 2a, when "feature 1" and "feature 2" are "Low" (a lower mean value) and "High" (a higher mean value), respectively, the difference between the pedestrian and the background of the FIR light images is larger than that of the visible light image. Therefore, the output value becomes "High" meaning that the candidate of the FIR light image is selected. However, the opposite case implies that the difference of the visible light image is larger than that of the FIR light image. The output value becomes "Low" which in other words implies that the candidate of the visible light image is selected. If the "feature 1" and "feature 2" are both "Low" or "High", it is difficult to determine which candidate is more desirable (between the two candidates of visible light and FIR light images), giving the output a "Medium" Value.
However, as shown in Figure 2c, the selected candidate is present not only in the pedestrian candidate (the red box) but also in the non-pedestrian candidate (the yellow box). Although the pedestrian candidate has high mean value in the difference image as mentioned before, the non-pedestrian candidate has a low mean value as shown in Figure 2b. Considering that, this study designs the rule table for non-pedestrian features as shown in Table 2b in order to have opposite features from Table 2a. In general, when the FIS uses two inputs, it employs the IF-THEN rule [28], and the output will be produced by AND or OR calculation depending on the relationship between the FIS inputs. This research selected an AND calculation among the IF-THEN rules as the FIS makes adaptive selection while considering "feature 1" and "feature 2" together. Figure 3 describes the linear membership function used in this research, which is widely used in the FIS as its calculation speed is very fast and its algorithm is less complex compared to the non-linear membership function [29][30][31]. As mentioned, the input images have pedestrian and non-pedestrian categories, and two fuzzy rule tables (see Table 2) were designed to reflect the differences in their features. In this regard, two input membership functions were used: one for the pedestrian and the other for the non-pedestrian. In order to more accurately determine the frame of the linear input membership function, this study gained a data distribution for "feature 1" and "feature 2" (see Figure 3a,b) by using part of the training data of the CNN(to be illustrated in Section 3.3). Based on this, each linear input membership function for pedestrian and non-pedestrian is separately ("Low", "High") designed. Also, as shown in Figure 3c, the output membership functions were designed for three outputs, "Low" "Medium" and "High". Figure 3c is not related to the data of Figure 3a,b. In conventional fuzzy inference system, the output membership function is usually designed heuristically. Therefore, we use the three linear membership functions, which have been widely used in the fuzzy inference system.
In general, when the FIS uses two inputs, it employs the IF-THEN rule [28], and the output will be produced by AND or OR calculation depending on the relationship between the FIS inputs. This research selected an AND calculation among the IF-THEN rules as the FIS makes adaptive selection while considering "feature 1" and "feature 2" together. Figure 3 describes the linear membership function used in this research, which is widely used in the FIS as its calculation speed is very fast and its algorithm is less complex compared to the nonlinear membership function [29][30][31]. As mentioned, the input images have pedestrian and nonpedestrian categories, and two fuzzy rule tables (see Table 2) were designed to reflect the differences in their features. In this regard, two input membership functions were used: one for the pedestrian and the other for the non-pedestrian. In order to more accurately determine the frame of the linear input membership function, this study gained a data distribution for "feature 1" and "feature 2" (see Figure 3a,b) by using part of the training data of the CNN(to be illustrated in Section 3.3). Based on this, each linear input membership function for pedestrian and non-pedestrian is separately ("Low", "High") designed. Also, as shown in Figure 3c, the output membership functions were designed for three outputs, "Low" "Medium" and "High". Figure 3c is not related to the data of Figure 3a,b. In conventional fuzzy inference system, the output membership function is usually designed heuristically. Therefore, we use the three linear membership functions, which have been widely used in the fuzzy inference system.  The "feature 1" (f1) and "feature 2" (f2) in this research can be "Low" and "High" each shown in Table 2. Therefore, their outputs become (Gf1 L (f1), Gf1 H (f1)) and (Gf2 L (f2), Gf2 H (f2)) due to function (Gf1 L (),Gf1 H (),Gf2 L (), and Gf2 H ()) of the input membership of Figure 3a,b. Four pairs of combinations were obtained from this and these became (Gf1 L (f1), Gf2 L (f2)), (Gf1 L (f1), Gf2 H (f2)), (Gf1 H (f1), Gf2 L (f2)), and (Gf1 H (f1), Gf2 H (f2)). The fuzzy rule table of the Max and Min rules [29], and the Table 2 help gain four inference values from four pairs of combinations.
For instance, when f1 = 0.7, f2 = 0.5 as shown in Figure   The "feature 1" (f1) and "feature 2" (f2) in this research can be "Low" and "High" each shown in Table 2. Therefore, their outputs become (G f1 , and G f2 H (·)) of the input membership of Figure 3a,b. Four pairs of combinations were obtained from this and these became (G f1 The fuzzy rule table of the Max and Min rules [29], and the Table 2 help gain four inference values from four pairs of combinations. For instance, when f1 = 0.7, f2 = 0.5 as shown in Figure 4, the output value gained by the input membership function becomes (G f1  Table 2. If (0.24(L), 0.68(L)), when applying the Min rule and the fuzzy rule of Table 2b (IF "Low" and "Low", THEN "Medium"), inference value will be determined as 0.2 (M). If (0.75(H), 0.68(L)) and applying the Max rule and the fuzzy rule of Table 2a (IF "High" and "Low", THEN "Low"), the inference value will be 0.75(L). Likewise, the inference value resulting from the four combinations are described in Tables 3 and 4. Table 3. An example of the Inference Value produced by Min and Max rules with fuzzy rule table of  Table 2a.   rule table of  Table 2b.  Table 2a (IF "High" and "Low", THEN "Low"), the inference value will be 0.75(L). Likewise, the inference value resulting from the four combinations are described in Tables 3 and 4.    rule table of  Table 2a.  Bisector and centroid methods are the means to determine the FIS output value by using all the inference values. The centroid method determines the FIS output value based on the geometric center of the area from the area (the purple colored area of Figure 5a) defined by all the inference values. The bisector method identifies the FIS output value based on the line dividing the defined area into two having the same size. Figure 5b is an example of a defuzzification process based on the inference values by Min rule of Table 4  the final output values using the smallest and largest values, which are gained by maximum inference. The MOM method uses the average value of the smallest and largest as the final output value. Figure  5a is an example of a defuzzification process based on the inference values by Max rule of Table 3 (0.32(H), 0.68(M), 0.75(L), and 0.75(M)). This figure only uses these values as its maximum inference values are 0.75(L) and 0.75(M). Therefore, as shown in Figure 5a, two output values (0.13 and 0.62) are produced by SOM and LOM methods, and their average value is gained as (0.375 = (0.13 + 0.62)/2) by MOM method.
Bisector and centroid methods are the means to determine the FIS output value by using all the inference values. The centroid method determines the FIS output value based on the geometric center of the area from the area (the purple colored area of Figure 5a) defined by all the inference values. The bisector method identifies the FIS output value based on the line dividing the defined area into two having the same size. Figure 5b is an example of a defuzzification process based on the inference values by Min rule of Table 4   As seen in Figure 2c, the produced candidate area exists not only in the red box (pedestrian candidate) but also in the yellow box (non-pedestrian candidate). As mentioned earlier, the pedestrian candidate has a higher mean value in the difference image while the non-pedestrian candidate has a low mean value just as Figure 2b. In the current level, it is possible to know whether the produced candidate area is under a pedestrian or a non-pedestrian category. Therefore, in order to design the FIS based on that, this study used the mean value of the gradient magnitude in the difference image within the produced candidate as "feature 3". By reflecting such a "feature 3" as a weight into the FIS output value, as shown in Figure 5, this work makes an adaptive selection among the two candidates, (the yellow and red boxes of Figure 2c), which results in one appropriate candidate for verification by the CNN. Figure 6 describes two distributions of "feature 3", produced from the pedestrian and the nonpedestrian data used in Figure 3a,b by using a Gaussian fitting. Similar to the difference image of Figure 2b, the gradient magnitude of the pedestrian candidate is generally higher than that of the non-pedestrian candidate. Therefore, the pedestrian distribution is on the right side of the nonpedestrian distribution as shown in Figure 6. As seen in Figure 2c, the produced candidate area exists not only in the red box (pedestrian candidate) but also in the yellow box (non-pedestrian candidate). As mentioned earlier, the pedestrian candidate has a higher mean value in the difference image while the non-pedestrian candidate has a low mean value just as Figure 2b. In the current level, it is possible to know whether the produced candidate area is under a pedestrian or a non-pedestrian category. Therefore, in order to design the FIS based on that, this study used the mean value of the gradient magnitude in the difference image within the produced candidate as "feature 3". By reflecting such a "feature 3" as a weight into the FIS output value, as shown in Figure 5, this work makes an adaptive selection among the two candidates, (the yellow and red boxes of Figure 2c), which results in one appropriate candidate for verification by the CNN. Figure 6 describes two distributions of "feature 3", produced from the pedestrian and the non-pedestrian data used in Figure 3a,b by using a Gaussian fitting. Similar to the difference image of Figure 2b, the gradient magnitude of the pedestrian candidate is generally higher than that of the non-pedestrian candidate. Therefore, the pedestrian distribution is on the right side of the non-pedestrian distribution as shown in Figure 6. pedestrian candidate has a higher mean value in the difference image while the non-pedestrian candidate has a low mean value just as Figure 2b. In the current level, it is possible to know whether the produced candidate area is under a pedestrian or a non-pedestrian category. Therefore, in order to design the FIS based on that, this study used the mean value of the gradient magnitude in the difference image within the produced candidate as "feature 3". By reflecting such a "feature 3" as a weight into the FIS output value, as shown in Figure 5, this work makes an adaptive selection among the two candidates, (the yellow and red boxes of Figure 2c), which results in one appropriate candidate for verification by the CNN. Figure 6 describes two distributions of "feature 3", produced from the pedestrian and the nonpedestrian data used in Figure 3a,b by using a Gaussian fitting. Similar to the difference image of Figure 2b, the gradient magnitude of the pedestrian candidate is generally higher than that of the non-pedestrian candidate. Therefore, the pedestrian distribution is on the right side of the nonpedestrian distribution as shown in Figure 6.  In this study, the FIS output value for the pedestrian, shown in Figure 5, is defined as o p and the FIS output value for the non-pedestrian is defined as o n−p . It defines the probability for finding a pedestrian to be (via Figure 6), and the probability for finding a non-pedestrian as p p and p n−p , respectively. This leads to the final output value (o FIS ) given through Equation (1): Finally, as given in Equation (2), the system adaptively selects one candidate that is more appropriate for the CNN-based classification of pedestrian and non-pedestrian. This selection is done between two (pedestrian) candidates in visible light and FIR images. The optimal threshold of Equation (2) is experimentally determined based on the pedestrian and non-pedestrian data used in Figure 3a,b:

Classification of Pedestrian and Non-Pedestrian by CNN
This research uses a CNN in order to classify the chosen candidate by Equation (2). The classification yields whether the candidate is of pedestrian or non-pedestrian (background) category. As shown in Figure 2d, the candidate can be gained by visible light image or the FIR image. Therefore, the candidate from the visible light image is used as the CNN input learned through the visible light image training set. On the other hand, the candidate from the FIR image is used as the input learned through the FIR image training set. Both structures are equal and are illustrated in Table 5 and Figure 7. Several previous studies, including AlexNet [36] and others [37,38], used a square shape with the same width and length as input images. However, the general pedestrian area, which this study aims to find, has longer length than its width. Therefore, when normalizing the size into a square shape, the image is unacceptably stretched toward its width compared to its length, and distorts its pedestrian area, making it difficult to extract the features accurately. Also, when selecting the CNN As seen in this table and figure, the CNN in this research includes five convolutional layers and three fully connected layers [35]. Input images are the pedestrian and the non-pedestrian candidate images. As each input candidate image has a different size, this paper considers the ratio of the width and length of the general pedestrian, and resizes them into 119 pixels (width), 183 pixels (height), three (channels) through bilinear interpolation.
Several previous studies, including AlexNet [36] and others [37,38], used a square shape with the same width and length as input images. However, the general pedestrian area, which this study aims to find, has longer length than its width. Therefore, when normalizing the size into a square shape, the image is unacceptably stretched toward its width compared to its length, and distorts its pedestrian area, making it difficult to extract the features accurately. Also, when selecting the CNN input image as a square shape without stretching toward the width direction, the background area (especially, on the left and right to the pedestrian), is heavily reflected on the output yielding inaccurate features. Considering this aspect, this study uses the pedestrian or the non-pedestrian images with a normalized size of 119-by-183 pixels (width-by-height) as the CNN input. Through such size normalization, when the object's size changes depending on its location relative to the camera, such change can be compensated. In addition, this study normalized the brightness of the input image by the zero-center method discussed in [39]. The 119-by-183 pixels (width-by-height) used in this method is much smaller than the 227-by-227 pixels (height-by-width) discussed in AlexNet [36]. Therefore, we can significantly reduce the number of filters in each convolution layers and the number of nodes in fully-connected layers than those in stated in the AlexNet. Also, AlexNet was designed in order to classify 1000 classes. However, this research can reduce the training time as it can distinguish only two classes of the pedestrian and non-pedestrian areas [35].
In the 1st convolutional layer, 96 filters with the size of 11 × 11 × 3 are used at a stride of 2 × 2 pixels in the horizontal and vertical directions. The size of the feature map is 55 × 87 × 96 in the 1st convolutional layer, such that 55 and 87 are the output width and height, respectively. The calculations are based on: (output width (or height) = (input width (or height) − filter width (or height) + 2× padding)/stride + 1 [40]). For instance, in Table 5, input height, filter height, padding, and stride are 183, 11, 0, and 2, respectively. Therefore, the output height becomes 87 ((183 − 11 + 2× 0)/2 + 1). Unlike the previous studies [41,42], this research relatively enlarges the filter size of the 1st convolutional layer as the input image is very dark with high level of noise by its nature. Therefore, the enlarged filter can control the feature, which can be extracted wrongly due to the noise. Therefore, a rectified linear unit (ReLU) layer is used for the calculation as given by Equation (3) [43][44][45]: where x and y are the input and output values, respectively. This formula can lessen the vanishing gradient problem [46], which can cause a faster processing speed than a non-linear activation function [35]. The local response normalization layer is used behind the ReLU layer, as described in Table 5, which has a formula as follows: In Equation (4), b i x,y is a value obtained by normalization [36]. In this research, we used 1, 0.0001, and 0.75 for the values of p, α, and β, respectively. a i x,y is the neuron activity computed by the application of the ith kernel at the location (x, y), and it performs normalization for the adjacent n kernel maps at the identical spatial position [36]. In this study, n was set as 5. N implies the total number of filters in the layer. In order to make the CNN structure resilient to the image translation and local noise, the feature map gained through the local response normalization layer goes through the max pooling layer as given in Table 5. Max pooling layer uses the output after selecting the maximum value among the figures within the defined mask ranges. This is similar to conducting a subsampling. Once it goes through the Max pooling layer, it will produce 96 feature maps with sizes of 27 × 43 pixels as shown in Table 5 and Figure 7.
In order to fine-tune the 1st convolutional layer, as given in Table 5 and Figure 7, the 2nd convolutional layer that has 128 filters with a size of 5 × 5 × 96, a stride of 1 × 1 pixels (in the horizontal and vertical directions), and a padding of 2 × 2 pixels (in the horizontal and vertical directions) can be used behind the 1st convolutional layer. Similar to the 1st convolutional layer, after going through ReLU, cross channel normalization, and max pooling layers, we obtained 128 feature maps with the size of 13 × 21 pixels as shown in Figure 7 and Table 5. The first two layers are used to extract the low-level image features, such as blobs texture feature or edges.
Then, three additional convolutional layers are used for the high-level feature extraction as given in Figure 7 and Table 5. In details, the 3rd convolutional layer adopts 256 filters with the size of 3 × 3 × 128, the 4th convolutional layer has 256 filters with the size of 3 × 3 × 256, and the 5th convolutional layer uses 128 filters with the size of 3 × 3 × 256.
Through these five convolutional layers, 128 feature maps with the size of 6 × 10 pixels are finally obtained, which are fed to the additional three fully connected layers including 4096, 1024, and 2 neurons, respectively. This research will finally classify two classes of pedestrian areas and non-pedestrian areas through a CNN. Therefore, the last (3rd) fully connected layer (called as "output layer") of Figure 7 and Table 5 has only two nodes. The 3rd fully connected layer uses Softmax function, as given through Equation (5) [44]: Given that the array of the output neurons is set as s, we can obtain the probability of neurons belonging to the jth class by dividing the value of the jth element by the summation of the values of all the elements. As illustrated in the previous studies [36,47], the CNN-based recognition system has an over-fitting problem, which can cause low recognition accuracy with testing data although the accuracy with the training data is still high. To address such problems, this research employs data augmentation and dropout methods [36,47], which can reduce the effects of over-fitting problem. More details about the outcome of the data augmentation are presented in Section 4.1. For the dropout method, we adopt the dropout probability of 50% to disconnect the connections several neurons between previous layer and the next layers in the fully connected network [35,36,47]. Table 6 and Figure 8 show the sample images from the database (DVLFPD-DB1), which were used in this study. This database is built independently by our lab, and is available with our trained CNN model to other researchers through [48] for the purposes of comparisons by other researchers. In total, there are four sub-databases, and the total number of frames of visible light images and FIR images is 4080 each.

Experimental Data and Training
To obtain the images, this study used a dual camera system [12] consisting of a Tau640 FIR camera (19 mm, FLIR, Wilsonville, OR, USA) [49], and a C600 visible light web-camera (Logitech, Lausanne, Switzerland) [50]. In order to record the filming conditions, a WH-1091 wireless weather station (Chanju Tech., Paju-si, Gyeonggi-do, Korea) was used [51].
This research conducted the CNN training, and the tests in such a way that a four-fold cross validation can be achieved by using the four sub-databases as shown in Figure 8. In addition, it conducted a data augmentation step in order to solve the overfitting issue when conducting the CNN training. For data augmentation, image translation and cropping was used based on previous research [36]. In other words, the study gained four additional augmented candidate images from a single original candidate image listed in Table 5. This was achieved by adjusting five pixel translations and cropping to box locations (up, down, right and left) that contained the pedestrian and the non-pedestrian candidates. The augmented data were used only for the CNN training. For testing purposes, non-augmented original candidate images were used.  To obtain the images, this study used a dual camera system [12] consisting of a Tau640 FIR camera (19 mm, FLIR, Wilsonville, OR, USA) [49], and a C600 visible light web-camera (Logitech, Lausanne, Switzerland) [50]. In order to record the filming conditions, a WH-1091 wireless weather station (Chanju Tech., Paju-si, Gyeonggi-do, Korea) was used [51].
This research conducted the CNN training, and the tests in such a way that a four-fold cross validation can be achieved by using the four sub-databases as shown in Figure 8. In addition, it Weather Conditions The experimental conditions in this research were as follows: all the tests were conducted in a desktop computer consisting of Intel ® Core™ i7-3770K CPU @ 3.50 GHz (four CPUs), main memory of 16 GB, and a GeForce GTX 1070 (1,920 CUDA cores) graphics card (NVIDIA, Santa Clara, CA, USA) with memory of 8 GB [52]. The algorithms of the CNN training and testing were implemented by Window Caffe (version 1) [53].
This study used stochastic gradient descent (SGD) method for the CNN training [54]. The SGD method is a tool to find the optimal weight, which minimizes the difference between the desired and the calculated outputs based on the derivatives [35].
Unlike the gradient descent (GD) method, the SGD method defines the total number of iterations by dividing the training set by the mini-batch size, sets the training completion time until it reaches the total number of iterations (set as 1 epoch), and conducts the training for the preset number of epoch. The CNN training parameters are as follows: base_lr = 0.01, lr_policy = step, minibatchsize = 128, stepsize = 1013 (5 epoch), max_iter = 4054 (20 epoch), momentum = 0.9, gamma = 0.1, weight_decay = 0.0001, regularization_type = L2. The detail explanations of these parameters can be found in the following literature [53]. Figure 9 shows the loss and the training accuracy for the CNN training process along with the number of iterations. The loss graph converges toward 0, and the training accuracy reaches 100% as the iteration of the four folds increase. At this condition, the CNN is considered to be fully trained.  [53]. Figure 9 shows the loss and the training accuracy for the CNN training process along with the number of iterations. The loss graph converges toward 0, and the training accuracy reaches 100% as the iteration of the four folds increase. At this condition, the CNN is considered to be fully trained. Figure 10 shows an example of 96 filters with 11 × 11 × 3 (as shown in Table 4) in the 1st convolutional layer, as identified through the training. For the purposes of visibility, the filters are resized five times as larger by bi-linear interpolation. In this study, the experiments used three types of databases, (a) the original DVLFPD-DB1, (b) the degraded DVLFPD-DB1 (see Section 4.2), which reflects Gaussian noise and Gaussian blurring into the original database, and (c) the open database (see Section 4.2), or the Ohio State University (OSU) color-thermal database [55]. Therefore, Figure 10 presents 96 filters -each gained from the CNN training by using these three types of databases. As shown in the following Table 7 of Section 4.1, the Bisector method has the highest performance among  Figure 10 shows an example of 96 filters with 11 × 11 × 3 (as shown in Table 4) in the 1st convolutional layer, as identified through the training. For the purposes of visibility, the filters are resized five times as larger by bi-linear interpolation. In this study, the experiments used three types of databases, (a) the original DVLFPD-DB1, (b) the degraded DVLFPD-DB1 (see Section 4.2), which reflects Gaussian noise and Gaussian blurring into the original database, and (c) the open database (see Section 4.2), or the Ohio State University (OSU) color-thermal database [55]. Therefore, Figure 10 presents 96 filters -each gained from the CNN training by using these three types of databases. As shown in the following Table 7 of Section 4.1, the Bisector method has the highest performance among those various defuzzification methods, and therefore, Figure 10 shows the shape of filter when using the Bisector method. By comparing the Figure 10a,b, the shapes of filters eligible for edge detection in Figure 10a is more distinctive than those in Figure 10b. That is because the edge strength in the degraded DVLFPD-DB1 is reduced by image blurring compared to that in the original DVLFPD-DB1.
In addition, by comparing the shapes of filters of Figure 10a-c, we can find that the shapes of left four filters of Figure 10c from OSU color-thermal database is simpler than those of Figure 10a,b. In addition, the shapes of right four filters of Figure 10c do not show the characteristics of direction compared to those of Figure 10a,b. That is because the pedestrian or non-pedestrian candidates in OSU color-thermal database is smaller than those in the original DVLFPD-DB1 and the degraded DVLFPD-DB1 as shown in Figures 8, 11 and 12. Therefore, more local features are extracted from OSU color-thermal database through CNN training to discriminate the pedestrian and non-pedestrian candidates than those from the original DVLFPD-DB1 and the degraded DVLFPD-DB1. those various defuzzification methods, and therefore, Figure 10 shows the shape of filter when using the Bisector method. By comparing the Figure 10a,b, the shapes of filters eligible for edge detection in Figure 10a is more distinctive than those in Figure 10b. That is because the edge strength in the degraded DVLFPD-DB1 is reduced by image blurring compared to that in the original DVLFPD-DB1.

Testing of the Proposed Method
The classification accuracy from the FIS's defuzzification method, proposed as the first test, is measured and presented in Table 7. This study defines the pedestrian and the non-pedestrian candidates as positive and negative data in order to test their performances. They are also defined as true negative (TN), true positives (TP), false negatives (FN), and false positives (FP). TN is the case where the background (non-pedestrian) candidate is correctly recognized as the background region, whereas TP is the case where the pedestrian candidate is correctly recognized as the pedestrian region. FN is the case where the pedestrian candidate is incorrectly recognized as the background region, whereas FP is the case where the background (non-pedestrian) candidate is incorrectly recognized as the pedestrian region. Based on these, we can define two errors of false negative rate (FNR) and false positive rate (FPR). In addition, two accuracies of true positive rate (TPR) and true negative rate (TNR) can be defined. In other words, TPR and TNR are calculated as 100-FNR (%) and 100-FPR (%) respectively. Table 7 shows TPR, TNR, FNR, and FPR after processing through the confusion matrix. For instance, according to the LOM method in Table 7, TPR, TNR, FNR, and FPR are 99.74%, 99.35%, 0.26%, and 0.65%, respectively. Table 7 presents the average value of the four testing accuracies produced by the four-fold cross validation. The test showed that the bisector method has a higher classification accuracy compared to the other methods. Based on this, this study evaluated the testing performance by using the bisector method-based FIS. The second test compared the classification accuracies among the HOG-SVM-based method [18,22], the CNN and single camera-based method (visible light or FIR camera) [6,10], and the late fusion CNN-based method [13], which are widely used in the previously reported pedestrian detection studies. For fair comparisons, the same augmented data (as reported in the previous studies [6,10,13,18,22]) were used in our method. In addition, the same testing data were used for our method and the previous methods. Table 8 shows the average value of the four testing accuracies produced by the four-fold cross validation. As described in Table 8, the proposed method is far more accurate than the previously studied methods. Also, for performance comparisons, this research used precision, recall, accuracy, and F1 score as given in Table 9. With TP, TN, FP, and FN, we have used the following four criteria for accuracy measurements [56]: where #TP, #TN, #FP, and #FN mean the numbers of TP, TN, FP, and FN, respectively. Minimum and maximum values of precision, recall, accuracy, and F1 score are 0 (%) and 100 (%), respectively, where 0 (%) and 100 (%) represent the lowest and highest accuracies, respectively. Table 9 shows the average value of the four testing accuracies produced by the four-fold cross validation. As described in Table 9, the proposed method is significantly more accurate than the previous methods. As the third experiment, this research created the degraded dataset artificially including Gaussian noise (sigma of 0.03) and Gaussian blurring (sigma of 0.5) in order to account for more environmental variables into the original dataset and evaluate them for their accuracy. Such factors have negative effects as they are able to exist in the actual intelligent surveillance camera system environment. Therefore, in order to exhibit a strong performance under such a poor condition, this study created a degraded dataset as shown in Figure 11.  Tables 10 and 11 show the average value of the four testing accuracies gained by the four-fold cross validation. As showed in Tables 10 and 11, even in the case of using the degraded dataset, the proposed method had better classification accuracy than the other methods.   Tables 10 and 11 show the average value of the four testing accuracies gained by the four-fold cross validation. As showed in Tables 10 and 11, even in the case of using the degraded dataset, the proposed method had better classification accuracy than the other methods.  The fourth experiment is based on the open database (OSU color-thermal database) [55] such that a fair comparison can be done by other researchers. As shown in Figure 12, OSU color-thermal database is an image gained by the FIR camera and visible light camera in the fixed outdoor with various environmental factors.  The fourth experiment is based on the open database (OSU color-thermal database) [55] such that a fair comparison can be done by other researchers. As shown in Figure 12, OSU color-thermal database is an image gained by the FIR camera and visible light camera in the fixed outdoor with various environmental factors.  Tables 12 and 13 show the average value of the four testing accuracies gained by the four-fold cross validation. As Tables 12 and 13 present, the proposed method shows a higher accuracy even with the OSU color-thermal database.   Tables 12 and 13 show the average value of the four testing accuracies gained by the four-fold cross validation. As Tables 12 and 13 present, the proposed method shows a higher accuracy even with the OSU color-thermal database.   Figure 13 shows TPR and FPR-based receiver operation characteristic (ROC) curves among the proposed method and the others with regard to three types of the databases. The figure presents the average graph of the four testing accuracies gained by the four-fold cross validation.
Late fusion CNN-based [13] Figure 13 shows TPR and FPR-based receiver operation characteristic (ROC) curves among the proposed method and the others with regard to three types of the databases. The figure presents the average graph of the four testing accuracies gained by the four-fold cross validation.  As explained before, FNR (100-TPR (%)) has the trade-off relationship with FPR. According to threshold of classification, larger FNR causes smaller FPR, and vice versa. Equal error rate (EER) is the error rate (FNR or FPR) when FNR is same to FPR. As shown in Figure 13, the accuracy of the proposed method is significantly higher than that of the previous methods. Figure 14 shows the examples of correct classification. Although the candidates were obtained in various environments of noise, blurring, size, and illuminations, all the cases of TP and TN are correctly recognized.  Figure 15 shows the examples of incorrect classification. In Figure 15a-c, the left and right images show the FP and FN cases, respectively. The FP errors happen when the shape of background is similar to a pedestrian (Figure 15a), lots of noise are included (Figure 15b), and the shape of a shadow is similar to that of a pedestrian (Figure 15c). The FN errors occur when the part of pedestrian is As explained before, FNR (100-TPR (%)) has the trade-off relationship with FPR. According to threshold of classification, larger FNR causes smaller FPR, and vice versa. Equal error rate (EER) is the error rate (FNR or FPR) when FNR is same to FPR. As shown in Figure 13, the accuracy of the proposed method is significantly higher than that of the previous methods. Figure 14 shows the examples of correct classification. Although the candidates were obtained in various environments of noise, blurring, size, and illuminations, all the cases of TP and TN are correctly recognized. As explained before, FNR (100-TPR (%)) has the trade-off relationship with FPR. According to threshold of classification, larger FNR causes smaller FPR, and vice versa. Equal error rate (EER) is the error rate (FNR or FPR) when FNR is same to FPR. As shown in Figure 13, the accuracy of the proposed method is significantly higher than that of the previous methods. Figure 14 shows the examples of correct classification. Although the candidates were obtained in various environments of noise, blurring, size, and illuminations, all the cases of TP and TN are correctly recognized.  Figure 15 shows the examples of incorrect classification. In Figure 15a-c, the left and right images show the FP and FN cases, respectively. The FP errors happen when the shape of background is similar to a pedestrian (Figure 15a), lots of noise are included (Figure 15b), and the shape of a shadow is similar to that of a pedestrian (Figure 15c). The FN errors occur when the part of pedestrian is  Figure 15 shows the examples of incorrect classification. In Figure 15a-c, the left and right images show the FP and FN cases, respectively. The FP errors happen when the shape of background is similar to a pedestrian (Figure 15a), lots of noise are included (Figure 15b), and the shape of a shadow is similar to that of a pedestrian (Figure 15c). The FN errors occur when the part of pedestrian is occluded in the candidate box (Figure 15a), lots of noises are included (Figure 15b), and a large background area is included in the detected pedestrian box (Figure 15a,c).
Sensors 2017, 17,1598 29 of 32 occluded in the candidate box (Figure 15a), lots of noises are included (Figure 15b), and a large background area is included in the detected pedestrian box (Figure 15a,c).

Conclusions
This paper made an adaptive selection to find the most appropriate candidate for pedestrian detection among the two pedestrian candidates of visible light and FIR camera images by using the FIS and suggested a new method to verify that candidate with the CNN. In order to test the accuracy of the algorithm under the various conditions, the study used not only the independently designed DVLFPD-DB1 but also the degraded DVLFPD-DB1 combining the original DVLFPD-DB1 with Gaussian blurring and noise. Also, the OSU color-thermal database, an open database, was used as well in order to compare the accuracy of the proposed method with the others.
CNN has been widely used for its performance in various fields. However, intensive training is required for the usage of CNN with lots of training data. In many applications, it is often the case that collecting lots of training data is a difficult procedure, so a subsequent data augmentation process is performed. To lessen this disadvantage of CNN-based methods, we have made our trained CNN model with our collected DVLFPD-DB1 and degraded one by Gaussian blurring and noise publically available to other researchers for the purpose of performing comparisons. In future work, the proposed method can form the basis for studying crime recognition and face detection of criminals. Further, there are plans to conduct research to sense emergency situations in vehicular environments by detecting various subjects through the front camera in the vehicle in order to utilize the proposed method for a driver assistance system.

Conclusions
This paper made an adaptive selection to find the most appropriate candidate for pedestrian detection among the two pedestrian candidates of visible light and FIR camera images by using the FIS and suggested a new method to verify that candidate with the CNN. In order to test the accuracy of the algorithm under the various conditions, the study used not only the independently designed DVLFPD-DB1 but also the degraded DVLFPD-DB1 combining the original DVLFPD-DB1 with Gaussian blurring and noise. Also, the OSU color-thermal database, an open database, was used as well in order to compare the accuracy of the proposed method with the others.
CNN has been widely used for its performance in various fields. However, intensive training is required for the usage of CNN with lots of training data. In many applications, it is often the case that collecting lots of training data is a difficult procedure, so a subsequent data augmentation process is performed. To lessen this disadvantage of CNN-based methods, we have made our trained CNN model with our collected DVLFPD-DB1 and degraded one by Gaussian blurring and noise publically available to other researchers for the purpose of performing comparisons. In future work, the proposed method can form the basis for studying crime recognition and face detection of criminals. Further, there are plans to conduct research to sense emergency situations in vehicular environments by detecting various subjects through the front camera in the vehicle in order to utilize the proposed method for a driver assistance system.