CNN-Based Crosswalk Pedestrian Situation Recognition System Using Mask-R-CNN and CDA

: Researchers are studying CNN (convolutional neural networks) in various ways for image classiﬁcation. Sometimes, they must classify two or more objects in an image into different situations according to their location. We developed a new learning method that colored objects from images and extracted them to distinguish the relationship between objects with different colors. We can apply this method in certain situations, such as pedestrians in a crosswalk. This paper presents a method for learning pedestrian situations on CNN using Mask R-CNN (Region-based CNN) and CDA (Crosswalk Detection Algorithm). With this method, we classiﬁed the location of the pedestrians into two situations: safety and danger. We organized the process of preprocessing and learning images into three stages. In Stage 1, we used Mask R-CNN to detect pedestrians. In Stage 2, we detected crosswalks with the CDA and placed colors on detected objects. In Stage 3, we combined crosswalks and pedestrian objects into one image and then, learned the image to CNN. We trained ResNet50 and Xception using images in the proposed method and evaluated the accuracy of the results. When tested experimentally, ResNet50 exhibited 96.7% accuracy and Xception showed 98.7% accuracy. We then created an image that simpliﬁed the situation with two colored boxes of crosswalks and pedestrians. We conﬁrmed that the learned CNN with the images of colored boxes could classify the same test images applied in the previous experiment with 96% accuracy by ResNet50. This result indicates that the proposed system is suitable for classifying pedestrian safety and dangerous situations by accurately dividing the positions of the two objects.


Introduction
Researchers initially used CNN [1] to classify objects in images.Later, they found a way to differentiate and detect each object in the image.In addition, they conducted their studies on object relationships or situation recognition between detected objects.Our research aimed to perform situation recognition between objects by detecting pedestrians and crosswalk objects in images, respectively.Through deep learning training, our system can distinguish whether a pedestrian in a crosswalk is safe from the driver's point of view.Many researchers studied object detection and object relationships.Hu et al. [2] devised the object relationship module using original and geometric weights to understand the dependence between objects for object detection [3].Redmon et al. [4,5] used YOLO (You Only Live Once) for object detection.Yatskar et al. [6] suggested extracting objects and labeling visual semantic roles in the image using VGG (Visual Geometry Group) for situation recognition.Dai et al. [7] developed a system for situation recognition in the image that applied the object detection stage using Fast R-CNN and a mutual recognition stage combining the pair filtering system for a subject feature and DR-Net.Li et al. [8] proposed GNN (Graphic Neural Network) that predicted the relationship between objects in images.They achieved this by analyzing the objects in the image on a graph.Furthermore, Shi et al. [9] recently suggested a new gait recognition system with a deep learning network.They used multimodal inertial sensors for their system.
Díaz-Cely et al. [10], Bianco et al. [11], and Geirhos et al. [12] proposed that color was an essential factor in CNN learning.Therefore, we devised a new method to color objects extracted from the image to distinguish the relationship between two objects of different colors.After learning the objects' colors, we demonstrated that our new method accurately classified the relationship between the objects.Therefore, we aimed to classify the location of the pedestrian on the crosswalk with the method.We trained CNN by extracting the image's crosswalks and pedestrians in different colors.We classified the pedestrian as safe when their feet were inside the crosswalk but dangerous outside.When tested experimentally, these learned CNN could accurately classify test images.
For the safety of pedestrians, it is critical to determine whether a pedestrian is walking inside or outside the crosswalk area.According to the National Highway Traffic Safety Administration of the U.S.A. [13], among the different types of traffic accidents, there are many fatal ones involving pedestrians at crosswalks.We detected a pedestrian on the crosswalk using Mask R-CNN [14].We then developed and applied the CDA to detect crosswalks.We colored pedestrians extracted in red and crosswalks in black.Among CNNs [15][16][17][18][19], we trained Xception [18] and ResNet50 [19] using the image data created in colors.Through experiments, we achieved 96-98% accuracy in classifying whether the test data were safe or dangerous with learned CNN.The contributions of this paper are as follows: Firstly, we proposed a CNN-based crosswalk-pedestrian situation recognition system that detects crosswalks and pedestrians in images and determines whether people are safe or in danger, depending on their location.We present further details of the method in Section 3.
Secondly, we developed and applied the CDA to the system to detect crosswalks in images.We provide further details of the algorithm in Section 3.2.
Thirdly, we created colored simple box shapes and used them for CNN learning.The experiment showed little difference in accuracy between actual field photo images and simplified boxes.We expect that somebody can prepare training datasets with efficient time and economic cost to train CNN.We present further details about simple box shapes in Section 4.2.

Related Works
The first object detection network using deep learning was the R-CNN [20], announced in 2014.Object detection since developed into networks such as Fast R-CNN [21] and Faster R-CNN [22].Mask R-CNN is a network that adds a fully convolutional network (FCN) based on Faster R-CNN.It consists of two stages, and the first is the region proposal network (RPN) which is a stage for extracting the object's location.The second stage is a parallel prediction of the binary mask, box offset, and class for each region of instance (RoI).Mask R-CNN added an FCN to the Faster R-CNN and applied FCN to the RoI extracted by the RPN.Therefore, it became possible to predict an M*M-sized mask without losing information.Faster R-CNN cannot split instances on a pixel basis, whereas Mask R-CNN does so.Furthermore, Mask R-CNN is capable of efficient computation, surpassing the performance of existing state-of-the-art (SOTA) networks.
Larson et al. [23] proposed and evaluated a dynamic passive pedestrian detection system (DPPD) over crosswalks using optical and thermal sensors.Their image analysis system with sensors determined the pedestrian's location in real time.Consequentially, an average accuracy of 89% and a standard deviation of 10% were exhibited in determining the location of pedestrians using thermal sensors.In addition, 82% average accuracy and a standard deviation of 8% were exhibited in establishing the location of pedestrians using optical sensors.Their system can also detect atypical pedestrians.This expression refers to unstructured pedestrians, such as pedestrians pulling strollers and those using umbrellas.Thus, Larson et al. enabled further system improvement by creating and evaluating two new systems.
In 2020, Zhang et al. [24] proposed a system for judging pedestrian traffic laws.They used the LSTM (Long Short Term Memory) neural network to predict pedestrian behavior.In their paper, the situation of the crosswalk was judged by predicting the pedestrian's behavior when a red light was displayed.They used nine characteristics, such as gender, walking direction, group behavior, etc.Their proposed system applied deep learning to predict pedestrians' unexpected behavior on crosswalks, consequentially exhibiting 91.6% accuracy.Their system allows for the prevention of collisions between pedestrians and vehicles on crosswalks.
Other works that detect pedestrians include studies by Prioletti et al. [25], Hariyono et al. [26], Hariyono and Jo [27], Keller and Gavrila [28], and Keller et al. [29].Prioletti et al.'s system distinguished whether pedestrians were on the road using the cascade classifier and histogram-of-oriented-gradient (HOG).Hariyono and Jo developed an algorithm that extracted edges by Hough line detection and color-based extraction, recognized crosswalks, and detected whether there were pedestrians in crosswalk areas using three classifiers.However, they did not apply deep learning in their system.Furthermore, Dow et al. [30] applied YOLO to detect pedestrians, and Zhang et al. [31] used YOLOv5 for crosswalk detection.Using Mask R-CNN, Malbog [32] suggested pedestrian crosswalk detection.However, Malbog did not investigate the relationship between pedestrians and crosswalks.Thus far, the research focused on simple image analysis or prediction of pedestrians' positions within the crosswalk using devices such as thermal and optical sensors.
Additionally, detection using YOLO has the disadvantage of not accurately representing the locations of pedestrians and crosswalks in pixels.Thus, we tested a new method using Mask R-CNN and the CDA.Our trained CNN enabled us to classify danger and safety based on the location of pedestrians and crosswalks.
Table 1 compares previous studies on crosswalk pedestrian situation recognition with our study on five items and shows actual detected sample images: accuracy of detecting a pedestrian and a crosswalk, use of deep learning, detect crosswalk, the sight of the car driver, method of detecting crosswalk.Our system showed the highest accuracy and satisfied all other items, showing superior results to the systems presented in previous research.When we tried to detect the crosswalk with YOLO, we did not make an exact crosswalk shape but only a simple box shape.Additionally, with Mask R-CNN, we cannot satisfy detecting crosswalks when crosslines are invisible according to the road condition.So, we developed our CDA, which can draw the crosswalk shape even if some parts of zebra cross lines are erased.In 2020, Zhang et al. [24] proposed a system for judging pedestrian traffic laws.They used the LSTM (Long Short Term Memory) neural network to predict pedestrian behavior.In their paper, the situation of the crosswalk was judged by predicting the pedestrian's behavior when a red light was displayed.They used nine characteristics, such as gender, walking direction, group behavior, etc.Their proposed system applied deep learning to predict pedestrians' unexpected behavior on crosswalks, consequentially exhibiting 91.6% accuracy.Their system allows for the prevention of collisions between pedestrians and vehicles on crosswalks.
Other works that detect pedestrians include studies by Prioletti et al. [25], Hariyono et al. [26], Hariyono and Jo [27], Keller and Gavrila [28], and Keller et al. [29].Prioletti et al.'s system distinguished whether pedestrians were on the road using the cascade classifier and histogram-of-oriented-gradient (HOG).Hariyono and Jo developed an algorithm that extracted edges by Hough line detection and color-based extraction, recognized crosswalks, and detected whether there were pedestrians in crosswalk areas using three classifiers.However, they did not apply deep learning in their system.Furthermore, Dow et al. [30] applied YOLO to detect pedestrians, and Zhang et al. [31] used YOLOv5 for crosswalk detection.Using Mask R-CNN, Malbog [32] suggested pedestrian crosswalk detection.However, Malbog did not investigate the relationship between pedestrians and crosswalks.Thus far, the research focused on simple image analysis or prediction of pedestrians' positions within the crosswalk using devices such as thermal and optical sensors.
Additionally, detection using YOLO has the disadvantage of not accurately representing the locations of pedestrians and crosswalks in pixels.Thus, we tested a new method using Mask R-CNN and the CDA.Our trained CNN enabled us to classify danger and safety based on the location of pedestrians and crosswalks.
Table 1 compares previous studies on crosswalk pedestrian situation recognition with our study on five items and shows actual detected sample images: accuracy of detecting a pedestrian and a crosswalk, use of deep learning, detect crosswalk, the sight of the car driver, method of detecting crosswalk.Our system showed the highest accuracy and satisfied all other items, showing superior results to the systems presented in previous research.When we tried to detect the crosswalk with YOLO, we did not make an exact crosswalk shape but only a simple box shape.Additionally, with Mask R-CNN, we cannot satisfy detecting crosswalks when crosslines are invisible according to the road condition.So, we developed our CDA, which can draw the crosswalk shape even if some parts of zebra cross lines are erased.In 2020, Zhang et al. [24] proposed a system for judging pedestrian traffic laws.They used the LSTM (Long Short Term Memory) neural network to predict pedestrian behavior.In their paper, the situation of the crosswalk was judged by predicting the pedestrian's behavior when a red light was displayed.They used nine characteristics, such as gender, walking direction, group behavior, etc.Their proposed system applied deep learning to predict pedestrians' unexpected behavior on crosswalks, consequentially exhibiting 91.6% accuracy.Their system allows for the prevention of collisions between pedestrians and vehicles on crosswalks.
Other works that detect pedestrians include studies by Prioletti et al. [25], Hariyono et al. [26], Hariyono and Jo [27], Keller and Gavrila [28], and Keller et al. [29].Prioletti et al.'s system distinguished whether pedestrians were on the road using the cascade classifier and histogram-of-oriented-gradient (HOG).Hariyono and Jo developed an algorithm that extracted edges by Hough line detection and color-based extraction, recognized crosswalks, and detected whether there were pedestrians in crosswalk areas using three classifiers.However, they did not apply deep learning in their system.Furthermore, Dow et al. [30] applied YOLO to detect pedestrians, and Zhang et al. [31] used YOLOv5 for crosswalk detection.Using Mask R-CNN, Malbog [32] suggested pedestrian crosswalk detection.However, Malbog did not investigate the relationship between pedestrians and crosswalks.Thus far, the research focused on simple image analysis or prediction of pedestrians' positions within the crosswalk using devices such as thermal and optical sensors.
Additionally, detection using YOLO has the disadvantage of not accurately representing the locations of pedestrians and crosswalks in pixels.Thus, we tested a new method using Mask R-CNN and the CDA.Our trained CNN enabled us to classify danger and safety based on the location of pedestrians and crosswalks.
Table 1 compares previous studies on crosswalk pedestrian situation recognition with our study on five items and shows actual detected sample images: accuracy of detecting a pedestrian and a crosswalk, use of deep learning, detect crosswalk, the sight of the car driver, method of detecting crosswalk.Our system showed the highest accuracy and satisfied all other items, showing superior results to the systems presented in previous research.When we tried to detect the crosswalk with YOLO, we did not make an exact crosswalk shape but only a simple box shape.Additionally, with Mask R-CNN, we cannot satisfy detecting crosswalks when crosslines are invisible according to the road condition.So, we developed our CDA, which can draw the crosswalk shape even if some parts of zebra cross lines are erased.In 2020, Zhang et al. [24] proposed a system for judging pedestrian traffic laws.They used the LSTM (Long Short Term Memory) neural network to predict pedestrian behavior.In their paper, the situation of the crosswalk was judged by predicting the pedestrian's behavior when a red light was displayed.They used nine characteristics, such as gender, walking direction, group behavior, etc.Their proposed system applied deep learning to predict pedestrians' unexpected behavior on crosswalks, consequentially exhibiting 91.6% accuracy.Their system allows for the prevention of collisions between pedestrians and vehicles on crosswalks.
Other works that detect pedestrians include studies by Prioletti et al. [25], Hariyono et al. [26], Hariyono and Jo [27], Keller and Gavrila [28], and Keller et al. [29].Prioletti et al.'s system distinguished whether pedestrians were on the road using the cascade classifier and histogram-of-oriented-gradient (HOG).Hariyono and Jo developed an algorithm that extracted edges by Hough line detection and color-based extraction, recognized crosswalks, and detected whether there were pedestrians in crosswalk areas using three classifiers.However, they did not apply deep learning in their system.Furthermore, Dow et al. [30] applied YOLO to detect pedestrians, and Zhang et al. [31] used YOLOv5 for crosswalk detection.Using Mask R-CNN, Malbog [32] suggested pedestrian crosswalk detection.However, Malbog did not investigate the relationship between pedestrians and crosswalks.Thus far, the research focused on simple image analysis or prediction of pedestrians' positions within the crosswalk using devices such as thermal and optical sensors.
Additionally, detection using YOLO has the disadvantage of not accurately representing the locations of pedestrians and crosswalks in pixels.Thus, we tested a new method using Mask R-CNN and the CDA.Our trained CNN enabled us to classify danger and safety based on the location of pedestrians and crosswalks.
Table 1 compares previous studies on crosswalk pedestrian situation recognition with our study on five items and shows actual detected sample images: accuracy of detecting a pedestrian and a crosswalk, use of deep learning, detect crosswalk, the sight of the car driver, method of detecting crosswalk.Our system showed the highest accuracy and satisfied all other items, showing superior results to the systems presented in previous research.When we tried to detect the crosswalk with YOLO, we did not make an exact crosswalk shape but only a simple box shape.Additionally, with Mask R-CNN, we cannot satisfy detecting crosswalks when crosslines are invisible according to the road condition.So, we developed our CDA, which can draw the crosswalk shape even if some parts of zebra cross lines are erased.Detect only a pedestrian O X X X Etc.[25][26][27][28][29] Detect only a pedestrian X X X X

Proposed Method
Figure 1 depicts our proposed overall process.Our system consists of three stages.Stage 1 uses Mask R-CNN to detect and extract pedestrians in the original images in red.Stage 2 uses the CDA to detect and extract crosswalks in original images in black.In Stage 3, we train CNN using training images that combine the images created in stages 1 and 2. We trained pedestrian situations with safety (in) when a person walked inside a crosswalk and danger (out) when walking outside.Subsequently, we evaluated the performance with test images different from the training images used for trained CNN.We created the test images in the same process as the training images through stages 1, 2, and 3.

Stage 1: Pedestrian Detection Using Mask R-CNN
Mask R-CNN is a segmentation model that localizes objects in pixel units.It uses RoIAlign to distinguish objects in pixels in an image.The RoIAlign technique prevents the loss of object location information and allows accurate feature maps.Figure 2 describes the Mask R-CNN framework, for instance, segmentation.Bakr et al. [33] showed that Mask R-CNN could distinguish an object's shadow from an object with 98.09% accuracy.
Figure 3 presents the process of Stage 1, which uses Mask R-CNN to detect pedestrians in original images.We extracted the masks of pedestrians detected by Mask R-CNN in red.We used Mask R-CNN because it can extract pedestrian objects on a pixel basis from the images.Additionally, selected pedestrian pixels were colored in red uniformly to learn CNN.Detect only a pedestrian O X X X Etc.[25][26][27][28][29] Detect only a pedestrian X X X X

Proposed Method
Figure 1 depicts our proposed overall process.Our system consists of three stages.Stage 1 uses Mask R-CNN to detect and extract pedestrians in the original images in red.Stage 2 uses the CDA to detect and extract crosswalks in original images in black.In Stage 3, we train CNN using training images that combine the images created in stages 1 and 2. We trained pedestrian situations with safety (in) when a person walked inside a crosswalk and danger (out) when walking outside.Subsequently, we evaluated the performance with test images different from the training images used for trained CNN.We created the test images in the same process as the training images through stages 1, 2, and 3.

Stage 1: Pedestrian Detection Using Mask R-CNN
Mask R-CNN is a segmentation model that localizes objects in pixel units.It uses RoIAlign to distinguish objects in pixels in an image.The RoIAlign technique prevents the loss of object location information and allows accurate feature maps.Figure 2 describes the Mask R-CNN framework, for instance, segmentation.Bakr et al. [33] showed that Mask R-CNN could distinguish an object's shadow from an object with 98.09% accuracy.
Figure 3 presents the process of Stage 1, which uses Mask R-CNN to detect pedestrians in original images.We extracted the masks of pedestrians detected by Mask R-CNN in red.We used Mask R-CNN because it can extract pedestrian objects on a pixel basis from the images.Additionally, selected pedestrian pixels were colored in red uniformly to learn CNN.

Proposed Method
Figure 1 depicts our proposed overall process.Our system consists of three stages.Stage 1 uses Mask R-CNN to detect and extract pedestrians in the original images in red.Stage 2 uses the CDA to detect and extract crosswalks in original images in black.In Stage 3, we train CNN using training images that combine the images created in stages 1 and 2. We trained pedestrian situations with safety (in) when a person walked inside a crosswalk and danger (out) when walking outside.Subsequently, we evaluated the performance with test images different from the training images used for trained CNN.We created the test images in the same process as the training images through stages 1, 2, and 3. Detect only a pedestrian O X X X Etc.[25][26][27][28][29] Detect only a pedestrian X X X X

Proposed Method
Figure 1 depicts our proposed overall process.Our system consists of three stages.Stage 1 uses Mask R-CNN to detect and extract pedestrians in the original images in red.Stage 2 uses the CDA to detect and extract crosswalks in original images in black.In Stage 3, we train CNN using training images that combine the images created in stages 1 and 2. We trained pedestrian situations with safety (in) when a person walked inside a crosswalk and danger (out) when walking outside.Subsequently, we evaluated the performance with test images different from the training images used for trained CNN.We created the test images in the same process as the training images through stages 1, 2, and 3.

Stage 1: Pedestrian Detection Using Mask R-CNN
Mask R-CNN is a segmentation model that localizes objects in pixel units.It uses RoIAlign to distinguish objects in pixels in an image.The RoIAlign technique prevents the loss of object location information and allows accurate feature maps.Figure 2 describes the Mask R-CNN framework, for instance, segmentation.Bakr et al. [33] showed that Mask R-CNN could distinguish an object's shadow from an object with 98.09% accuracy.

Stage 1: Pedestrian Detection Using Mask R-CNN
Mask R-CNN is a segmentation model that localizes objects in pixel units.It uses RoIAlign to distinguish objects in pixels in an image.The RoIAlign technique prevents the loss of object location information and allows accurate feature maps.Figure 2 describes the Mask R-CNN framework, for instance, segmentation.Bakr et al. [33] showed that Mask R-CNN could distinguish an object's shadow from an object with 98.09% accuracy.

Stage 2: Crosswalk Detection Using the Crosswalk Detection Algorithm (CDA)
Figure 4 depicts the process of Stage 2, which uses the CDA to detect crosswalks in original images.Usually, the area of crosswalks is colored white, and several white rectangles are drawn on the road.So, first, the images are made to be black and white.If the white area is more than the specific size pixels, it would be a part of the crosswalk image.Additionally, the CDA draws polylines and connects all crosswalk parts.Finally, we extract the crosswalk area detected by the CDA in black.The color of the crosswalk area is designated uniformly to learn CNN.
Algorithm 1 presents the CDA, the process of which is as follows: (

Stage 2: Crosswalk Detection Using the Crosswalk Detection Algorithm (CDA)
Figure 4 depicts the process of Stage 2, which uses the CDA to detect crosswalks in original images.Usually, the area of crosswalks is colored white, and several white rectangles are drawn on the road.So, first, the images are made to be black and white.If the white area is more than the specific size pixels, it would be a part of the crosswalk image.Additionally, the CDA draws polylines and connects all crosswalk parts.Finally, we extract the crosswalk area detected by the CDA in black.The color of the crosswalk area is designated uniformly to learn CNN.
Algorithm 1 presents the CDA, the process of which is as follows: (1) Thresholding: make the white crosswalk visible by making it clear what is white and what is not; (2) Morphology opening and closing operation: reduce noise by erosion and expansion in areas other than crosswalks.Make the crosswalk area clear; (3) Find contours, filter good contours, and combine good contours: obtain crosswalk

Stage 2: Crosswalk Detection Using the Crosswalk Detection Algorithm (CDA)
Figure 4 depicts the process of Stage 2, which uses the CDA to detect crosswalks in original images.Usually, the area of crosswalks is colored white, and several white rectangles are drawn on the road.So, first, the images are made to be black and white.If the white area is more than the specific size pixels, it would be a part of the crosswalk image.Additionally, the CDA draws polylines and connects all crosswalk parts.Finally, we extract the crosswalk area detected by the CDA in black.The color of the crosswalk area is designated uniformly to learn CNN.Algorithm 1 presents the CDA, the process of which is as follows:

Experiments
We performed experiments to test whether the proposed method recognized the situation of pedestrians on the crosswalk.Table 2 presents the experimental details.The dataset configuration consisted of safety (inside) and danger (outside).We constructed two pedestrian situations on the crosswalk into 510 cases.Regarding the number of datasets, there were 510 sheets for each of the original and processed images.From 510 sheets, we used 360 sheets as training images and 150 sheets as test images.The total number of box images for experiment II was 600 sheets.So, we used 600 box images as training data and 150 processed images of experiment I as the test data.We trained ResNet50 and Xception and tested them with our datasets on Google Colaboratory online environment.Their learning rate was 0.001 and training epochs were 100. Figure 5 depicts the images of the datasets used.Figure 5a presents the original images, and Figure 5b shows the processed images created through the proposed method.6a depicts the case of Xception trained with the original images, which yielded a test accuracy of 68.8%. Figure 6b displays the case of ResNet50 trained with original images, which yielded a test accuracy of 70%.However, training CNNs with processed images resulted in a significant improvement in accuracy.Figure 6c presents the case of Xception trained with the processed image, which yielded an accuracy of 98.7%. Figure 6d displays the case of ResNet50 trained with the processed image, which yielded an accuracy of 96.7%.These results reveal that CNN can judge pedestrian situations when trained using our proposed method.

Experiment II
We named the images of Figure 5c applied in the experiment as box images.Figure 7 presents the overall process of the proposed system using box images.These use color and box shapes to create the training images.Hence, the red box denotes a pedestrian and the black box indicates a crosswalk.In addition, the experimental environments were the same as in Experiment I.
Figure 8 shows the result of training CNNs using the data of Figure 5c.

Experiment II
We named the images of Figure 5c applied in the experiment as box images.Figure 7 presents the overall process of the proposed system using box images.These use color and box shapes to create the training images.Hence, the red box denotes a pedestrian and the black box indicates a crosswalk.In addition, the experimental environments were the same as in Experiment I.
Figure 8 shows the result of training CNNs using the data of Figure 5c.

Conclusions
In this paper, we processed images with Mask R-CNN and the CDA, a self-developed algorithm, and increased the accuracy in classifying whether pedestrians were safe or dangerous situations.We trained and tested the method on CNN using preprocessed images.We achieved 98.7% accuracy for Xception and 96.7% for ResNet50.Therefore, our proposed system is suitable for classifying pedestrian situations.Furthermore, we trained

Conclusions
In this paper, we processed images with Mask R-CNN and the CDA, a self-developed algorithm, and increased the accuracy in classifying whether pedestrians were safe or dangerous situations.We trained and tested the method on CNN using preprocessed images.We achieved 98.7% accuracy for Xception and 96.7% for ResNet50.Therefore, our proposed system is suitable for classifying pedestrian situations.Furthermore, we trained

Conclusions
In this paper, we processed images with Mask R-CNN and the CDA, a self-developed algorithm, and increased the accuracy in classifying whether pedestrians were safe or dangerous situations.We trained and tested the method on CNN using preprocessed images.We achieved 98.7% accuracy for Xception and 96.7% for ResNet50.Therefore, our proposed system is suitable for classifying pedestrian situations.Furthermore, we trained CNN with box images.This result achieved 94.5% accuracy for Xception and 96% for ResNet50 with the same test data.We expect that somebody can train CNN with efficient time and economic cost because they easily prepare box images as training data.

Figure 3 .
Figure 3. Detecting a pedestrian using Mask R-CNN.

1 )
Thresholding: make the white crosswalk visible by making it clear what is white and what is not; (2) Morphology opening and closing operation: reduce noise by erosion and expansion in areas other than crosswalks.Make the crosswalk area clear; (3) Find contours, filter good contours, and combine good contours: obtain crosswalk contours and extract multiple contents from one image.If the content is larger than a pixel area (e.g.,170, pixels) of a specific size, it is judged by the crosswalk image.Combine the values of the good contours array; (4) Obtain a convex hull, sort the points of contours and draw polylines: convex hull function combines the good contour image with the original image to make a small square into an entire large square.Sort the points of contours combined by x-coordinate (in case of a tie, sort by y-coordinate).Draw polylines in red; (5) Fill the inside of the polylines with black.

Figure 3
Figure3presents the process of Stage 1, which uses Mask R-CNN to detect pedestrians in original images.We extracted the masks of pedestrians detected by Mask R-CNN in red.We used Mask R-CNN because it can extract pedestrian objects on a pixel basis from the images.Additionally, selected pedestrian pixels were colored in red uniformly to learn CNN.

Figure 3 .
Figure 3. Detecting a pedestrian using Mask R-CNN.

Figure 3 .
Figure 3. Detecting a pedestrian using Mask R-CNN.

( 1 )
Thresholding: make the white crosswalk visible by making it clear what is white and what is not; (2) Morphology opening and closing operation: reduce noise by erosion and expansion in areas other than crosswalks.Make the crosswalk area clear; (3) Find contours, filter good contours, and combine good contours: obtain crosswalk contours and extract multiple contents from one image.If the content is larger than a pixel area (e.g.,170, pixels) of a specific size, it is judged by the crosswalk image.Combine the values of the good contours array; (4) Obtain a convex hull, sort the points of contours and draw polylines: convex hull function combines the good contour image with the original image to make a small square into an entire large square.Sort the points of contours combined by x-coordinate (in case of a tie, sort by y-coordinate).Draw polylines in red; (5) Fill the inside of the polylines with black.ppl.Sci.2023, 13, x FOR PEER REVIEW 6 of 12

3. 3 .
Stage 3: Training Using CNN Stage 3 combines the images created using Mask R-CNN and the CDA.When the image of the crosswalk overlaps with the image of the pedestrian, it must not invade this image.Subsequently, training images are applied to learn images of the pedestrian's safe situation (inside) and the pedestrian's dangerous situation (outside) on CNN.Finally, we confirmed the performance by creating different test images from the training images used to test the learned CNN.
Figure 5 depicts the images of the datasets used.Figure 5a presents the original images, and Figure 5b shows the processed images created through the proposed method.Figure 5a,b images are training and test images of Experiment I. We displayed the training images of Experiment II in Figure 5c.We applied ResNet50 and Xception to training images in 100 epochs in a Google Colaboratory learning environment.The source of our dataset is available on GitHub [34].
Figure 5a,b images are training and test images of Experiment I. We displayed the training images of Experiment II in Figure 5c.We applied ResNet50 and Xception to training images in 100 epochs in a Google Colaboratory learning environment.The source of our dataset is available on GitHub [34].

Figure 5 .
Figure 5. Three types of datasets of the experiments; (a) Original Images and (b) Processed Images applied for Experiment I, and (c) Box Images applied for Experiment II.

4. 1 .Figure 6
Figure 6 presents the result of training CNNs using the data of Figure 5a,b.Figure6adepicts the case of Xception trained with the original images, which yielded a test accuracy of 68.8%.Figure6bdisplays the case of ResNet50 trained with original images, which yielded a test accuracy of 70%.However, training CNNs with processed images resulted in a significant improvement in accuracy.Figure6cpresents the case of Xception trained with the processed image, which yielded an accuracy of 98.7%.Figure6ddisplays the case of ResNet50 trained with the processed image, which yielded an accuracy of 96.7%.These results reveal that CNN can judge pedestrian situations when trained using our proposed

Figure 5 .
Figure 5. Three types of datasets of the experiments; (a) Original Images and (b) Processed Images applied for Experiment I, and (c) Box Images applied for Experiment II.

4. 1
Figure 6 presents the result of training CNNs using the data of Figure 5a,b.Figure 6a depicts the case of Xception trained with the original images, which yielded a test accuracy of 68.8%. Figure 6b displays the case of ResNet50 trained with original images, which yielded a test accuracy of 70%.However, training CNNs with processed images resulted Figure 6 presents the result of training CNNs using the data of Figure 5a,b.Figure 6a depicts the case of Xception trained with the original images, which yielded a test accuracy of 68.8%. Figure 6b displays the case of ResNet50 trained with original images, which yielded a test accuracy of 70%.However, training CNNs with processed images resulted

12 Figure 6 .
Figure 6.Results of Experiment I: (a) training and testing with original images using Xception; (b) training and testing with original images using ResNet50; (c) training and testing with processed images using Xception; (d) training and testing with processed images using ResNet50.
Figure 8a depicts the case of Xception trained with box images which yielded a test accuracy of 94.5%. Figure 8b displays the case of ResNet50 trained with box images which yielded a test accuracy of 96%.Table 3 presents the accuracy (%) of experiments I & II tested by ResNet50 and Xception.These results reveal that learning CNNs with images created by coloring simple shapes is as accurate as the method used in Experiment I.There was just a 0.7 difference between the two experiments with ResNet50.If somebody can train an AI with box images for a specific purpose, as in our case, efficient time and economic cost are possible to prepare training datasets.

Figure 6 .
Figure 6.Results of Experiment I: (a) training and testing with original images using Xception; (b) training and testing with original images using ResNet50; (c) training and testing with processed images using Xception; (d) training and testing with processed images using ResNet50.
Figure 8a depicts the case of Xception trained with box images which yielded a test accuracy of 94.5%. Figure 8b displays the case of ResNet50 trained with box images which yielded a test accuracy of 96%.Table 3 presents the accuracy (%) of experiments I & II tested by ResNet50 and Xception.These results reveal that learning CNNs with images created by coloring simple shapes is as accurate as the method used in Experiment I.There was just a 0.7 difference between the two experiments with ResNet50.If somebody can train an AI with box images for a specific purpose, as in our case, efficient time and economic cost are possible to prepare training datasets.

Figure 7 .
Figure 7. Overall Process: CNN-based crosswalk-pedestrian situation recognition system using box images.

Figure 8 .
Figure 8. Results of Experiment II: (a) train and test with box images using Xception; (b) train and test with box images using ResNet50.

Figure 7 .
Figure 7. Overall Process: CNN-based crosswalk-pedestrian situation recognition system using box images.

Figure 7 .
Figure 7. Overall Process: CNN-based crosswalk-pedestrian situation recognition system using box images.

Figure 8 .
Figure 8. Results of Experiment II: (a) train and test with box images using Xception; (b) train and test with box images using ResNet50.

Figure 8 .
Figure 8. Results of Experiment II: (a) train and test with box images using Xception; (b) train and test with box images using ResNet50.

Table 1 . Comparison of our system to others; Accuracy and CDA. Systems Accuracy of Detecting a Pedestrian & a Crosswalk Use of Deep Learning Detect Crosswalk The Sight of the Car Driver Method of Detecting Crosswalk Samples of Detecting Crosswalk
Larson et al. enabled further system improvement by creating and evaluating two new systems.

Table 1 .
Comparison of our system to others; Accuracy and CDA.

Accuracy of Detecting a Pedestrian & a Crosswalk Use of Deep Learning Detect Cross- walk The Sight of the Car Driver Method of De- tecting Cross- walk Samples of Detecting Crosswalk
[30]ors Dow et al.[30]Detectonly a pedestrian O box shape X YOLO umbrellas.Thus, Larson et al. enabled further system improvement by creating and evaluating two new systems.

Table 1 .
Comparison of our system to others; Accuracy and CDA.

Accuracy of Detecting a Pedestrian & a Crosswalk Use of Deep Learning Detect Cross- walk The Sight of the Car Driver Method of De- tecting Cross- walk Samples of Detecting Crosswalk
umbrellas.Thus, Larson et al. enabled further system improvement by creating and evaluating two new systems.

Table 1 .
Comparison of our system to others; Accuracy and CDA.

Algorithm 1 :
The Crosswalk Detection Algorithm (CDA) Input: Original Image (β 1 . . .β w ), #w is the number of pixels in the crosswalk image.

Table 2 .
Experimental details for Training CNN; ResNet50 and Xception.