Detection of Parking Slots Based on Mask R-CNN

: Obtaining information on parking slots is a prerequisite for the development of automatic parking systems, which is an essential part of the automatic driving processes. In this paper, we proposed a parking-slot-marking detection approach based on deep learning. The detection process involves the generation of mask of the marking-points by using the Mask R-CNN algorithm, extracting parking guidelines and parallel lines on the mask using the line segment detection (LSD) to determine the candidate parking slots. The experimental results show that the proposed method works well under the condition of complex illumination and around-view images from di ﬀ erent sources, with a precision of 94.5% and a recall of 92.7%. The results also indicate that it can be applied to diverse slot types, including vertical, parallel and slanted slots, which is superior to previous methods.


Introduction
Parking is an essential task in the driving process as vehicles will have to park at some point during or at the end of the driving process.However, completing this task conveniently and safely during the driving process is not trivial.As a result, empirical research on developing intelligent, safe and convenient parking systems abounds in literature [1,2].One of such parking systems that has drawn much attention is the automatic parking system (APS) with its major component being the identification of parking slots that affects path planning and motion control.Three main methods were employed in the detection of parking slots: the infrastructure-based methods, the free-space-based approach and the parking-slot-marking-based approach.The infrastructure-based method [3][4][5] usually requires the installation of intelligent infrastructure in the parking slots in order for the parking slots information to be obtained in real-time.However, the drawback of this approach is the high cost for its commercialization.The free-space-based approach [6][7][8][9][10][11][12][13][14][15][16][17][18][19] has been widely studied and typically uses different types of sensors, such as short-range radars [6,7], lidars [8][9][10][11] and ultrasonic radars [12][13][14][15], which provide range distance information to detect parking slots.Compared with the ultrasonic radar, the short-range radars and lidars have higher frequency, which is able to detect parking slots with longer distance and provide high accuracy of the edge information, especially when driving at high speed.However, the cost of these two sensors is high.Ultrasonic radar is the most widely used because it is cheap and easy to install, but it usually needs to design a complex algorithm to improve the detection accuracy of parking slots.The disadvantages of above sensors are that they cannot detect free spaces when there is no adjacent vehicle, with its accuracy dependent on the positions of adjacent vehicles.
Compared with the free-space-based method, the parking-slot-marking-based approach provides more accurate information on parking slots using around-view monitoring (AVM) systems composed of multiple fisheye cameras that provide a wider field of vision without depending on the presence of adjacent vehicles.This method uses visual algorithms for the detection of the parking slots.
The research direction of the around-view monitoring (AVM) system mainly focuses on camera calibration [16][17][18] and image stitching [19][20][21].Zhang's calibration [16] has maintained the accuracy of calibration while ensuring the simplicity of the calibration experiment.It has become one of the main methods for camera calibration of machine vision systems.The mainstream image stitching methods usually select different image features (such as SURF [19], SIFT [20]) for automatic registration to achieve image stitching.These methods are insensitive to the geometric transformation of the image and have high robustness, but they are computation-intensive and time-consuming, which cannot satisfy real-time requirements of an AVM system.
Most of the current visual algorithms are based on two visual features: corners and lines, detected by some low-level vision algorithms (such as Fast detector [22], Harris detector [23][24][25], Hough transform [26], Radon transform [27], Ransac transform [28][29][30]), which are sensitive to light and make it difficult to maintain robustness.Therefore, to solve the above problems, this paper proposes a method to detect parking slots based on deep learning.The mask of marking-points is generated by Mask R-CNN [31]; then, after the line segment detection (LSD) algorithm [32] is used to detect the mask and filter the interference lines, the guidelines and parallel lines can be found to finalize candidate parking slots.
Our contributions in this paper are summarized as follows: (1) A method for detecting parking slots based on Mask R-CNN is proposed.Specifically, we employed Resnet101 [33] and feature pyramid networks (FPN) [34] to extract and combine the image features of marking-points, which have more robust detection under varied illumination conditions, compared with traditional detectors and the detector trained by machine learning.(2) The proposed method can detect parking slots with different tilt angles and accurately separate the parking guidelines from the adjacent lane lines, which is prior to previous methods.(3) There is a single training image type of previous learning-based methods.We make and collect different types of AVM images for training.The proposed method accurately detects the marking-points in the AVM images with different stitching effects and gives more robust detection results.
Aside from the introduction, the rest of this paper is organized as follows.Section 2 introduces related research, Section 3 explains the generation of around-view images, Section 4 describes the proposed scheme and experimental results are presented in Section 5. Finally, Section 6 concludes the paper.

Traditional Algorithms
The traditional methods for feature utilization algorithms can be categorized into two types: slot-corner-detection-based [22][23][24][25] and line-detection-based methods [26][27][28][29][30].For instance, Chen et al. [22] used the Fast detector to recognize corners to determine parking slots.Jae-kyu et al. [23][24][25] used the corner features to identify slot patterns and parking slots.They used Harris detector to extract the lowest level features (corners) in the hierarchical tree structure.Although their results indicated a decent performance in parking-slot identification, it was hard to obtain the exact position and heading angle of corner features.The success rate also largely depended on the robustness of the corner detector.
On the other hand, line-detection-based algorithms include Hough [26], Radon [27], RANSAC [28][29][30].Jung et al. [26] detected parking slots by finding parallel line pairs using a specialized filter and Hough transform.Wang et al. [27] utilized a similar method that locates parallel line pairs using the Radon transform.Jae kyu et al. [28] also used RANSAC to detect separating lines in AVM images; parking slots were generated by pairing the combined separating lines based on the geometric constraints.The location and orientation of parking slots can be determined fast by line detection, however, it is susceptible to interference from external objects, such as lane lines, curbs, buildings, etc., around the parking slots.Therefore, complex filters or classification methods need to be designed.

Machine Learning and Deep Learning
The rapid development of machine learning in the field of image processing has also been applied to parking-slot detection.For instance, Zhang et al. [35] proposed a parking-slot detection method, which uses a dataset of marking-points trained on multiple corresponding detectors with AdaBoost cascade classifier, where the parking slots can be inferred with six Gaussian templates.Li et al. [36] also used the LSD algorithm to detect parallel line pairs first, then T-shaped or L-shaped marking-points of parking slots determined by the trained detector.Compared to the approach in Zhang et al. [35], only one detector needs to be trained.Although the above two methods were based on learning image features that improved the accuracy and stability of detecting the marking-points to a great extent, they only detected parallel and vertical parking slots in terms of the ability to extract image features.These approaches extracted the marking-points by artificially selecting low-level image features (such as Haar, LBH) for training, which had a certain gap with deep learning.
With regard to the use of deep learning algorithms for the detection of parking slots, in a preliminary attempt, Zinelli et al. [37] roughly detected four vertices of parking slots on the AVM image using Faster R-CNN.Jang et al. [38] also proposed a method using semantic segmentation and vertical grid encoding to detect parking slots and static obstacles based on the AVM system.A lot of the algorithms that use deep learning to detect parking slots used static images taken from a camera [39][40][41].These approaches are typically used to detect vacant slots for parking-lot management and are not applicable to dynamic environments, as they rely on strong a priori knowledge in the location of the slots.Therefore, our study seeks to fill this gap by developing a parking-slot detection approach using Mask R-CNN to detect parking slots under dynamic environments and various illuminating conditions.

Generation of Around-View Images
The around-view parking-assist system provides drivers with real-time information on the surrounding environment of the vehicle in an imaging manner.It usually consists of multiple fisheye cameras installed around the vehicle, each camera facing a different direction, thus ensuring that each fisheye camera and the images taken by two adjacent fisheye cameras have overlapping areas.
Four fisheye cameras are mounted on the experimental vehicle, each of which has a horizontal field of view of 180 degrees.We use four fisheye images and camera parameters are input to synthesize a 360-degree panorama around the vehicle image, as shown in Figure 1b.
Appl.Sci.2020, 10, x FOR PEER REVIEW 3 of 18 however, it is susceptible to interference from external objects, such as lane lines, curbs, buildings, etc., around the parking slots.Therefore, complex filters or classification methods need to be designed.

Machine Learning and Deep Learning
The rapid development of machine learning in the field of image processing has also been applied to parking-slot detection.For instance, Zhang et al. [35] proposed a parking-slot detection method, which uses a dataset of marking-points trained on multiple corresponding detectors with AdaBoost cascade classifier, where the parking slots can be inferred with six Gaussian templates.Li et al. [36] also used the LSD algorithm to detect parallel line pairs first, then T-shaped or L-shaped marking-points of parking slots determined by the trained detector.Compared to the approach in Zhang et al. [35], only one detector needs to be trained.Although the above two methods were based on learning image features that improved the accuracy and stability of detecting the marking-points to a great extent, they only detected parallel and vertical parking slots in terms of the ability to extract image features.These approaches extracted the marking-points by artificially selecting low-level image features (such as Haar, LBH) for training, which had a certain gap with deep learning.
With regard to the use of deep learning algorithms for the detection of parking slots, in a preliminary attempt, Zinelli et al. [37] roughly detected four vertices of parking slots on the AVM image using Faster R-CNN.Jang et al. [38] also proposed a method using semantic segmentation and vertical grid encoding to detect parking slots and static obstacles based on the AVM system.A lot of the algorithms that use deep learning to detect parking slots used static images taken from a camera [39][40][41].These approaches are typically used to detect vacant slots for parking-lot management and are not applicable to dynamic environments, as they rely on strong a priori knowledge in the location of the slots.Therefore, our study seeks to fill this gap by developing a parking-slot detection approach using Mask R-CNN to detect parking slots under dynamic environments and various illuminating conditions.

Generation of Around-View Images
The around-view parking-assist system provides drivers with real-time information on the surrounding environment of the vehicle in an imaging manner.It usually consists of multiple fisheye cameras installed around the vehicle, each camera facing a different direction, thus ensuring that each fisheye camera and the images taken by two adjacent fisheye cameras have overlapping areas.
Four fisheye cameras are mounted on the experimental vehicle, each of which has a horizontal field of view of 180 degrees.We use four fisheye images and camera parameters are input to synthesize a 360-degree panorama around the vehicle image, as shown in Figure 1b.The generation of the AVM image is the process of mapping from the bird's-eye view coordinate system to the original fisheye image.The key step is to generate the mapping table T B→F , then to take a pixel point F out of the original fisheye image and fill it into a corresponding position B in the bird's-eye view image based on OpenCL [42] quickly and efficiently.As shown in Figure 2, the mapping process and related coordinate transformation of the bird's-eye view image are explained.
Reverse mapping process of bird's-eye image.
We divide the bird's-eye view into four areas and each area corresponds to a mapping table TB→F, as shown in Figure 2. In the traversal of bird's-eye view image, the coordinates of each pixel are mapped to the world coordinate system by a transformation matrix PB→W, then the world coordinate system is projected to the undistorted image coordinate system by a homography matrix PW→U.
Finally, the lookup table TU→F maps a point on the undistorted image to a position on the original input fisheye image.
The following describes how to obtain the mapping table TB→F: we create a bird's-eye view image with a width of u pixels and a height of v pixels, which is used to display the ground range with the width of W mm and the length of H mm in the world coordinate system, as shown in Figure 3. Therefore, we get the transformation matrix PB→W: The mapping table TU→F can be determined based on obtaining the distortion coefficients and the parameters of the fisheye camera by Zhang's calibration method.Accordingly, we correct the original fisheye images to undistorted images based on TU→F.To obtain the homography matrix PW→U, we arrange a checkerboard around the experimental vehicle to obtain the feature points X i u in the world coordinate system.Next, we manually select the corresponding feature points X i w in the undistorted image.The number of feature points i cannot be less than four, as shown in  We divide the bird's-eye view into four areas and each area corresponds to a mapping table T B→F , as shown in Figure 2. In the traversal of bird's-eye view image, the coordinates of each pixel are mapped to the world coordinate system by a transformation matrix P B→W , then the world coordinate system is projected to the undistorted image coordinate system by a homography matrix P W→U .Finally, the lookup table T U→F maps a point on the undistorted image to a position on the original input fisheye image.
The following describes how to obtain the mapping table T B→F : we create a bird's-eye view image with a width of u pixels and a height of v pixels, which is used to display the ground range with the width of W mm and the length of H mm in the world coordinate system, as shown in Figure 3. Therefore, we get the transformation matrix P B→W : The mapping table T U→F can be determined based on obtaining the distortion coefficients and the parameters of the fisheye camera by Zhang's calibration method.Accordingly, we correct the original fisheye images to undistorted images based on T U→F .To obtain the homography matrix P W→U , we arrange a checkerboard around the experimental vehicle to obtain the feature points X i u in the world coordinate system.Next, we manually select the corresponding feature points X i w in the undistorted image.The number of feature points i cannot be less than four, as shown in Figure 4.It is known that X i u = P W→U X i w , therefore, P W→U can be estimated based on the least-squares method.

Method for Detecting Parking Slots Based on Mask R-CNN
Mask R-CNN is a multi-task deep neural network that combines the ideas of FPN and fully convolutional networks (FCN) [31] base on Faster R-CNN.It does not only effectively detect the marking-points, but also generates high-quality segmentation mask for the marking-points.
Marking-points are defined as the cross-points of parking lines, showing a T or L shape, as shown in the red rectangles in Figure 5.The main steps of the proposed scheme are as follows: the first part is image preprocessing, including selecting the region of interest (ROI) and contrast enhancement.Next, the mask of marking-points is generated by Mask R-CNN and then, using the LSD algorithm, the guidelines and parallel lines are found, based on the mask.Finally, candidate parking slots can be determined.Figure 6 illustrates an overall flow chart of the proposed scheme.

Method for Detecting Parking Slots Based on Mask R-CNN
Mask R-CNN is a multi-task deep neural network that combines the ideas of FPN and fully convolutional networks (FCN) [31] base on Faster R-CNN.It does not only effectively detect the marking-points, but also generates high-quality segmentation mask for the marking-points.
Marking-points are defined as the cross-points of parking lines, showing a T or L shape, as shown in the red rectangles in Figure 5.The main steps of the proposed scheme are as follows: the first part is image preprocessing, including selecting the region of interest (ROI) and contrast enhancement.Next, the mask of marking-points is generated by Mask R-CNN and then, using the LSD algorithm, the guidelines and parallel lines are found, based on the mask.Finally, candidate parking slots can be determined.Figure 6 illustrates an overall flow chart of the proposed scheme.

Method for Detecting Parking Slots Based on Mask R-CNN
Mask R-CNN is a multi-task deep neural network that combines the ideas of FPN and fully convolutional networks (FCN) [31] base on Faster R-CNN.It does not only effectively detect the marking-points, but also generates high-quality segmentation mask for the marking-points.
Marking-points are defined as the cross-points of parking lines, showing a T or L shape, as shown in the red rectangles in Figure 5.The main steps of the proposed scheme are as follows: the first part is image preprocessing, including selecting the region of interest (ROI) and contrast enhancement.Next, the mask of marking-points is generated by Mask R-CNN and then, using the LSD algorithm, the guidelines and parallel lines are found, based on the mask.Finally, candidate parking slots can be determined.Figure 6 illustrates an overall flow chart of the proposed scheme.
Marking-points are defined as the cross-points of parking lines, showing a T or L shape, as shown in the red rectangles in Figure 5.The main steps of the proposed scheme are as follows: the first part is image preprocessing, including selecting the region of interest (ROI) and contrast enhancement.Next, the mask of marking-points is generated by Mask R-CNN and then, using the LSD algorithm, the guidelines and parallel lines are found, based on the mask.Finally, candidate parking slots can be determined.Figure 6 illustrates an overall flow chart of the proposed scheme.When the driver is driving towards the parking slots, marking-points will appear on both sides of the vehicle, as shown in the blue rectangles in Figure 7.We define the ROI instead of detecting marking-points on the entire image, which reduces computing resources.Besides, the further away from the AVM image center, the more severe the distortion.For the vertical parking slots in particular, some marking-points are blurred, as shown in the red rectangles in   When the driver is driving towards the parking slots, marking-points will appear on both sides of the vehicle, as shown in the blue rectangles in Figure 7.We define the ROI instead of detecting marking-points on the entire image, which reduces computing resources.Besides, the further away from the AVM image center, the more severe the distortion.For the vertical parking slots in particular, some marking-points are blurred, as shown in the red rectangles in   When the driver is driving towards the parking slots, marking-points will appear on both sides of the vehicle, as shown in the blue rectangles in Figure 7.We define the ROI instead of detecting marking-points on the entire image, which reduces computing resources.Besides, the further away from the AVM image center, the more severe the distortion.For the vertical parking slots in particular, some marking-points are blurred, as shown in the red rectangles in Figure 7. Blurred marking-points increase the difficulty of sample annotation.Therefore, it is difficult to detect all the markings based on the entire image.When the driver is driving towards the parking slots, marking-points will appear on both sides of the vehicle, as shown in the blue rectangles in Figure 7.We define the ROI instead of detecting marking-points on the entire image, which reduces computing resources.Besides, the further away from the AVM image center, the more severe the distortion.For the vertical parking slots in particular, some marking-points are blurred, as shown in the red rectangles in

Production of the Training Set
Since the orientation of the marking-point is arbitrary, to gain a better training effect, we rotate each positive image patch with a set of angles [2πr k ] k k=1 , where r k is a random number uniformly distributed over (−1,1), and K defines the number of possible sample rotations.That is, if N positive samples are labeled, we will obtain NK positive samples for training.Next is to enhance the contrast of the samples with Laplace operator, which effectively highlights the edge information of marking-points under the condition of light blocked as shown in Figure 9.The orientation of marking-point is divided into two categories, R belongs (−π/2, π/2), L belongs (π/2, −π/2), as shown in Figure 10.This step is helpful to determine the orientation of the parking-slot after using the LSD algorithm to detect parallel lines.Please refer to Section 4.3 for details.

Build Mask R-CNN Training Model
Mask R-CNN is a two-stage framework.In the first stage, the backbone network of Mask R-CNN (Resnet101 and FPN) extracts and combines the feature maps of the AVM image.The regional proposal network (RPN) is used to recommend anchors of marking-points, then anchors with high scores can be found by filtering and correcting.In the second stage, the category and position of each anchor are predicted, and the corresponding mask of marking-points is generated.The overall network structure is shown in Figure 13.

Build Mask R-CNN Training Model
Mask R-CNN is a two-stage framework.In the first stage, the backbone network of Mask R-CNN (Resnet101 and FPN) extracts and combines the feature maps of the AVM image.The regional proposal network (RPN) is used to recommend anchors of marking-points, then anchors with high scores can be found by filtering and correcting.In the second stage, the category and position of each anchor are predicted, and the corresponding mask of marking-points is generated.The overall network structure is shown in Figure 13.

Build Mask R-CNN Training Model
Mask R-CNN is a two-stage framework.In the first stage, the backbone network of Mask R-CNN (Resnet101 and FPN) extracts and combines the feature maps of the AVM image.The regional proposal network (RPN) is used to recommend anchors of marking-points, then anchors with high scores can be Appl.Sci.2020, 10, 4295 9 of 18 found by filtering and correcting.In the second stage, the category and position of each anchor are predicted, and the corresponding mask of marking-points is generated.The overall network structure is shown in Figure 13.The above two stages are described in detail as follows: the powerful Resnet101 was selected to extract the feature of marking-point.We took 4 different scales of feature maps (C1,C2,C3,C4) output by Residual Block.These feature maps were used to establish the feature pyramid of FPN network, and get new features (P2,P3,P4,P5,P6), respectively.The feature combination process is shown in Equation (2).For i = 5, 4, 3, 2, U6 = 0: where sum represents the element-by-element alignment operation, conv represents the convolution operation, upsample represents the upsampling operation that doubles the length and width of the feature map and pooling represents maximum pooling operation with a stride of 2.

Anchors of marking-points are recommended by the region proposal network (RPN). A sliding
window is used to slide on the five feature maps (P2,P3,P4,P5,P6).After the sliding operation, the 2k dimension classification and 4k dimension position information are regressed to describe the correction values of k anchors.The correction values of each anchor include Δx, Δy, Δh, ΔW, P, where P is the confidence degree of foreground and background.Formula (3) is the correct formula for anchors:

w x y y h y w w w h h h
where (x,y) is the coordinate of anchor center point, w and h are the width and height of anchor, respectively, and x′, y′, w′, h′ are corresponding correction values.The correction process is shown in Figure 14a.After the correction of a large number of anchors is completed, the non-maximum suppression method (NMS) [31] is used to retain anchors with high foreground scores and transfer them to the next stage, as shown in Figure 14b.The above two stages are described in detail as follows: the powerful Resnet101 was selected to extract the feature of marking-point.We took 4 different scales of feature maps (C 1 ,C 2 ,C 3 ,C 4 ) output by Residual Block.These feature maps were used to establish the feature pyramid of FPN network, and get new features (P 2 ,P 3 ,P 4 ,P 5 ,P 6 ), respectively.The feature combination process is shown in Equation (2).For i = 5, 4, 3, 2, U 6 = 0: where sum represents the element-by-element alignment operation, conv represents the convolution operation, upsample represents the upsampling operation that doubles the length and width of the feature map and pooling represents maximum pooling operation with a stride of 2.
Anchors of marking-points are recommended by the region proposal network (RPN).A sliding window is used to slide on the five feature maps (P 2 ,P 3 ,P 4 ,P 5 ,P 6 ).After the sliding operation, the 2k dimension classification and 4k dimension position information are regressed to describe the correction values of k anchors.The correction values of each anchor include ∆x, ∆y, ∆h, ∆W, P, where P is the confidence degree of foreground and background.Formula (3) is the correct formula for anchors: where (x,y) is the coordinate of anchor center point, w and h are the width and height of anchor, respectively, and x , y , w , h are corresponding correction values.The correction process is shown in Figure 14a.After the correction of a large number of anchors is completed, the non-maximum suppression method (NMS) [31] is used to retain anchors with high foreground scores and transfer them to the next stage, as shown in Figure 14b.To map the ROI in different sizes of anchors to the feature map of fixed size, ROI Align is proposed to select four regular positions in each ROI block, while double-line interpolation is used to calculate the exact value of each position and summarize the results.Finally, uniform ROIs are classified, the position of anchors is regressed and the mask of marking-point is generated by FCN.As shown in Figure 15a,b, Figure 16 shows the AVM image with marking-points detected by Mask R-CNN.

Parking-Slot Inference Base on Marking-Points
Having detected the marking-points (parking slots normally correspond to two marking-points), as shown in Figure 17b, we can perform the following steps to infer parking slots.We carry on corrosion and dilation operations to the mask of marking-points first, as shown in Figure 17c.Then, the mask is used to detect lines base on the LSD algorithm, as shown in Figure 17d.The line segments where the length does not meet a certain threshold value need to be removed.For similar lines, we take the average of the slope and intercept, merge them into a line as shown in Figure 17e and get the guidelines and parallel lines.Each parallel line points to two opposite directions.According to the classified results: left(L) and right(R) of the marking-point, the orientation of the parking slot can be identified.Finally, the "depth" of the parking slots is determined by prior knowledge.

Parking-Slot Inference Base on Marking-Points
Having detected the marking-points (parking slots normally correspond to two markingpoints), as shown in Figure 17b, we can perform the following steps to infer parking slots.We carry on corrosion and dilation operations to the mask of marking-points first, as shown in Figure 17c.
Then, the mask is used to detect lines base on the LSD algorithm, as shown in Figure 17d.The line segments where the length does not meet a certain threshold value need to be removed.For similar lines, we take the average of the slope and intercept, merge them into a line as shown in Figure 17e and get the guidelines and parallel lines.Each parallel line points to two opposite directions.
According to the classified results: left(L) and right(R) of the marking-point, the orientation of the parking slot can be identified.Finally, the "depth" of the parking slots is determined by prior knowledge. (

Training Platform and Selection of Pre-Training Model
The proposed method is written in Python 3.7, training and testing are completed under baiGraphic Processing Unit (GPU) acceleration.Tensorflow is used as the basis for construction, the computer is configured as a GTX1070TI with an 8 GB memory graphics card, considering that the training of the Mask R-CNN model requires a lot of time and dozens of training samples for higher level of accuracy.To improve the detection accuracy of the model, we use the transfer learning method to take the weight model obtained from the official coco2014 dataset [44] as the pre-training model of the marking-point detection algorithm.

Evaluating the Performance of Marking-Point Detection
We collected a total of 4500 AVM images.As to the distribution of training, validation and test, we refer to other common datasets [45,46] by the ratio of 6:1:3.In detail, the numbers of images for training, validation and test are 2700, 450 and 1350, respectively.
The mAP is the mean average precision [31], as shown in Figure 18.When the iterations are less than 20,000, the mAP of the four methods is 0.82 lower; as the iterations increase, the overall mAP of Mask R-CNN and Faster R-CNN is higher than that of SSD [47] and Yolo v1 [48].The main reason is that there are differences in the mechanism of generating initial anchors between the above methods.
Mask R-CNN and Faster R-CNN use the RPN network to generate scores of anchors on feature image, while Yolo v1 and SSD divide a certain number of grids on the original image to directly regress, which has obvious advantages and disadvantages.The detection effect of small-sized marking-points

Training Platform and Selection of Pre-Training Model
The proposed method is written in Python 3.7, training and testing are completed under baiGraphic Processing Unit (GPU) acceleration.Tensorflow is used as the basis for construction, the computer is configured as a GTX1070TI with an 8 GB memory graphics card, considering that the training of the Mask R-CNN model requires a lot of time and dozens of training samples for higher level of accuracy.To improve the detection accuracy of the model, we use the transfer learning method to take the weight model obtained from the official coco2014 dataset [44] as the pre-training model of the marking-point detection algorithm.

Evaluating the Performance of Marking-Point Detection
We collected a total of 4500 AVM images.As to the distribution of training, validation and test, we refer to other common datasets [45,46] by the ratio of 6:1:3.In detail, the numbers of images for training, validation and test are 2700, 450 and 1350, respectively.
The mAP is the mean average precision [31], as shown in Figure 18.When the iterations are less than 20,000, the mAP of the four methods is 0.82 lower; as the iterations increase, the overall mAP of Mask R-CNN and Faster R-CNN is higher than that of SSD [47] and Yolo v1 [48].The main reason is that there are differences in the mechanism of generating initial anchors between the above methods.Mask R-CNN and Faster R-CNN use the RPN network to generate scores of anchors on feature image, while Yolo v1 and SSD divide a certain number of grids on the original image to directly regress, which has obvious advantages and disadvantages.The detection effect of small-sized marking-points is poor, but the detection speed of SSD and YOLO v1 is up to 18 fps and 22 fps, respectively, as shown in Table 1.
of the FCN branch, which is used to generate the mask of marking-points, the detection speed of Mask R-CNN (2 fps) is lower than that of Faster R-CNN (7 fps), as shown in table 1.

Evaluating the Performance of Parking-Slot Detection
To facilitate the study of vision-based parking-slot detection, Zhang et al. [16] established the public dataset [49] of AVM images and made it publicly available to the research community.
Previous approaches used a single type of AVM images for training and testing, but the quality of the fisheye image stitching algorithm affects the detection of marking-points.As shown in Figure 19, the clarity and stitching effect of Figure 19a are significantly better than that of Figure 19b.To test the generalizability of the proposed approach, we divided the training samples into three groups, of which groups 1 and 2 each had 4000 AVM images that were from our experimental vehicle and the public dataset [49], respectively; 2000 AVM images were randomly selected from group 1 and group 2, respectively, which were used to form group 3.The initial learning rate was set to 0.   The mAP of SSD is maintained at about 0.82; compared with YOLO v1, SSD divides multiple scales of the grids on the original map for regression, which performs better in detecting the small size of marking-points.Hence, the mAP of SSD is slightly better than that of Yolo v1.Mask R-CNN uses Resnet101 instead of Vgg16 in Faster R-CNN to extract the feature of marking-points, which improves the ability to extract and recognize the image features.Therefore, the mAP of Mask R-CNN is always higher than that of Faster R-CNN during the entire training.However, due to the addition of the FCN branch, which is used to generate the mask of marking-points, the detection speed of Mask R-CNN (2 fps) is lower than that of Faster R-CNN (7 fps), as shown in Table 1.

Evaluating the Performance of Parking-Slot Detection
To facilitate the study of vision-based parking-slot detection, Zhang et al. [16] established the public dataset [49] of AVM images and made it publicly available to the research community.Previous approaches used a single type of AVM images for training and testing, but the quality of the fisheye image stitching algorithm affects the detection of marking-points.As shown in Figure 19, the clarity and stitching effect of Figure 19a are significantly better than that of Figure 19b.To test the generalizability of the proposed approach, we divided the training samples into three groups, of which groups 1 and 2 each had 4000 AVM images that were from our experimental vehicle and the public dataset [49], respectively; 2000 AVM images were randomly selected from group 1 and group 2, respectively, which were used to form group 3.The initial learning rate was set to 0.           Table 2 shows the slot-detection performances among different methods.The test dataset consists of 256 AVM images from the experimental vehicle and the public dataset [49], which includes various types of parking slots under complex illumination.Precision, recall, accuracy and F1 scores in Formulas ( 5)-( 8  It can be seen that compared with traditional corner detector and line detection, the three Mask R-CNN models have better performance, of which the accuracy, precision, recall and F1 scores of M3 are 93.4%,94.5%, 92.7% and 93.5%, respectively.The main reason is that Mask R-CNN has a strong ability to extract image features of marking-points, especially in underground parking slots or at night, where the light is dim or in the presence of reflected light.The accuracy of the method in Zhang et al. [16] was lower than that of M3-only 88.8%-because if slot marking's contrast with the ground is not remarkable, parking slots are classified as free space.In addition, the method can detect only two parking types: perpendicular and parallel parking.Faster R-CNN cannot detect the four vertices of parking slots when the complete parking slots are often unable to be presented in the AVM images, and the lane lines are often mistakenly regarded as the parking lines; therefore, its accuracy is only 82.2%.Corner features cannot always provide precise target positions and heading angles, besides, it is difficult for a corner feature to maintain robustness.Meanwhile, the line-based method provides an accurate direction and has fewer constraints compared with corner-based methods.Therefore, the accuracy and F1 scores of the line-based method are 7.3% and 4.3% higher than that of the corner-based method.
In all training images of M2 there are obvious splice marks that affect the feature extraction of marking-points, as shown in Figure 19b.It is prone to generating false detection against the background of light and shade interlacing, which means that the number of false positives in M2 was higher than that in M1 and M3 during the test.Therefore, the recall of M2 is slightly lower than that of M1 and M3.Since M1 and M2 use only a single source of AVM images for training, the precision of M1 and M2 are 5.2% and 3.3% lower than that of M3, respectively, indicating that increasing the diversity of training samples can improve the generalization ability of Mask R-CNN.It also reflects that the proposed method can realize the detection of parking slots in different around-view parking-assist systems.
AVM images with parking slots detected by Mask R-CNN are shown in Figure 21. Figure 21a,b shows slanted parking slots with different tilt angles, Figure 21c,d shows vertical parking slots; it can be found that the mask of marking-points accurately separates the guideline from the adjacent lane line.Figure 21e,f shows outdoor parking slots at night, Figure 21g-i shows indoor parking slots.From the above test samples, we conclude that the proposed method has a strong capability for accurately detecting different types of parking slots under various conditions.

Conclusions and Future Research
To detect parking slots, many previous studies conducted primitive feature extractions of the parking-slot markings, for instance, the corner detector and line method.Although these methods detect slot markings moderately well in bright light, they are rather ineffective regarding shadows or faint lines and corners.The method in Zhang et al. [16] increases the precision and recall to 86.9% and 87.3%, but it cannot detect slanted parking slots and needs to train multiple marking-point detectors.
Compared with Faster R-CNN, the proposed method adds the FCN branch to generate the mask of marking-points, which determines the direction of the parking slots and separates the guideline from the adjacent lane line accurately.Moreover, the proposed method was evaluated in various parking conditions and has higher precision (94.5%) and recall (92.7%).
We employed deep-learning-based Mask R-CNN on an AVM image, a large number of AVM images were made and collected.The experimental results showed that the trained model had strong adaptability and could accurately find the marking points in the AVM images with different stitching effects and separate the guideline from the adjacent lane line.This suggests that expanding the training set could allow the trained model to recognize the vast majority of parking-slot types.The main advantage of this method over the current state of the art for parking-slot detection is its

Conclusions and Future Research
To detect parking slots, many previous studies conducted primitive feature extractions of the parking-slot markings, for instance, the corner detector and line method.Although these methods detect slot markings moderately well in bright light, they are rather ineffective regarding shadows or faint lines and corners.The method in Zhang et al. [16] increases the precision and recall to 86.9% and 87.3%, but it cannot detect slanted parking slots and needs to train multiple marking-point detectors.Compared with Faster R-CNN, the proposed method adds the FCN branch to generate the mask of marking-points, which determines the direction of the parking slots and separates the guideline from the adjacent lane line accurately.Moreover, the proposed method was evaluated in various parking conditions and has higher precision (94.5%) and recall (92.7%).
We employed deep-learning-based Mask R-CNN on an AVM image, a large number of AVM images were made and collected.The experimental results showed that the trained model had strong adaptability and could accurately find the marking points in the AVM images with different stitching effects and separate the guideline from the adjacent lane line.This suggests that expanding

Figure 4
, PW→U can be estimated based on the least-squares method.

Figure 3 .
Figure 3. Transformation relationship between the bird's-eye view coordinate system and world coordinate system.

Figure 4 .
Figure 4. Calibration transformation from the world coordinate system to the undistorted image.(a) Original fisheye image, the green circles are the feature points selected in the world coordinate system.(b) Undistorted image, the red circles are the corresponding feature points in the undistorted image.

Figure 3 .Figure 3 .
Figure 3. Transformation relationship between the bird's-eye view coordinate system and world coordinate system.

Figure 4 .
Figure 4. Calibration transformation from the world coordinate system to the undistorted image.(a) Original fisheye image, the green circles are the feature points selected in the world coordinate system.(b) Undistorted image, the red circles are the corresponding feature points in the undistorted image.

Figure 4 .
Figure 4. Calibration transformation from the world coordinate system to the undistorted image.(a) Original fisheye image, the green circles are the feature points selected in the world coordinate system.(b) Undistorted image, the red circles are the corresponding feature points in the undistorted image.

Figure 7 .
Blurred marking-points increase the difficulty of sample annotation.Therefore, it is difficult to detect all the markings based on the entire image.The specific ROI parameters are shown in Figure 8, where L and W represent the length and width of the ROI region, L1 and W1 represent the length and width of the vehicle.The geometric constraints are L = L1 + 2 * L2, W = W1 + 2 * W2.

Figure 7 .
Blurred marking-points increase the difficulty of sample annotation.Therefore, it is difficult to detect all the markings based on the entire image.The specific ROI parameters are shown in Figure 8, where L and W represent the length and width of the ROI region, L1 and W1 represent the length and width of the vehicle.The geometric constraints are L = L1 + 2 * L2, W = W1 + 2 * W2.

Figure 6 .
Figure 6.Overall flow chart of the proposed scheme.

Figure 7 .
Blurred marking-points increase the difficulty of sample annotation.Therefore, it is difficult to detect all the markings based on the entire image.The specific ROI parameters are shown in Figure 8, where L and W represent the length and width of the ROI region, L1 and W1 represent the length and width of the vehicle.The geometric constraints are L = L1 + 2 * L2, W = W1 + 2 * W2.

Figure 8 .
Figure 8. Definition of the region of interest (ROI).

Figure 8 .
Figure 8. Definition of the region of interest (ROI).
Appl.Sci.2020, 10, x FOR PEER REVIEW 8 of 18 as a JSON file, as shown in Figure 11.The generated JSON files storing annotation information are converted into the training images for Mask R-CNN, as shown in Figure 12.

Figure 11 .Figure 12 .
Figure 11.The annotation of the AVM image.

Figure 11 .
Figure 11.The annotation of the AVM image.
Appl.Sci.2020, 10, x FOR PEER REVIEW 8 of 18 as a JSON file, as shown in Figure 11.The generated JSON files storing annotation information are converted into the training images for Mask R-CNN, as shown in Figure 12.

Figure 11 .Figure 12 .
Figure 11.The annotation of the AVM image.

Figure 12 .
Figure 12.Typical training samples.(a,d) Parallel and vertical parking slots, respectively; (b,e) segmentation of marking-points; (c,f) gray images with the mask of marking-points.

Figure 14 .Figure 15 .Figure 16 .
Figure 14.Anchors generation and filtering.(a) The dashed lines indicate the top 100 anchors with the highest foreground score recommended by RPN, the solid lines represent the corresponding anchors corrected by the regression information; (b) after non-maximum suppression method (NMS) filtering.To map the ROI in different sizes of anchors to the feature map of fixed size, ROI Align is proposed to select four regular positions in each ROI block, while double-line interpolation is used to calculate the exact value of each position and summarize the results.Finally, uniform ROIs are classified, the position of anchors is regressed and the mask of marking-point is generated by FCN.As shown in Figure15a,b, Figure16shows the AVM image with marking-points detected by Mask R-CNN.

Figure 14 .
Figure 14.Anchors generation and filtering.(a) The dashed lines indicate the top 100 anchors with the highest foreground score recommended by RPN, the solid lines represent the corresponding anchors corrected by the regression information; (b) after non-maximum suppression method (NMS) filtering.

Figure 14 .
Figure 14.Anchors generation and filtering.(a) The dashed lines indicate the top 100 anchors with

Figure 17 .
Figure 17.Detection process of parking guidelines and parallel lines.(a) AVM image; (b) mask of marking-points; (c) after corroding and dilating; (d) line detection; (e) the guideline and parallel line.

Figure 17 .
Figure 17.Detection process of parking guidelines and parallel lines.(a) AVM image; (b) mask of marking-points; (c) after corroding and dilating; (d) line detection; (e) the guideline and parallel line.
0001 and the training epoch was 300, the corresponding Mask R-CNN model-M1, M2, M3-were obtained by training, respectively.
0001 and the training epoch was 300, the corresponding Mask R-CNN model-M1, M2, M3-were obtained by training, respectively.

Figure 19 .where
Figure 19.AVM images with different stitching effects.(a) From our experimental vehicle; (b) from

Figure 20 317 Figure 20 .
Figure 20.Loss of Mask R-CNN models with different training datasets.

Figure 19 .
Figure 19.AVM images with different stitching effects.(a) From our experimental vehicle; (b) from the public dataset [49].This paper trains the Mask R-CNN model by reducing the loss function between the predicted value and the real label.The loss function of Mask R-CNN is defined as follows:loss = L cls + L box + L mask (4)where L cls indicates the classification error, L box is the bounding box regression error, L mask indicates the mask error.Figure20shows the loss during the entire training.It can be seen that three models converge to 0.05 when they are trained to the 300th epoch, indicating that the proposed method for detecting marking-points based on Mask R-CNN has certain robustness.

Figure 20
shows the loss during the entire training.It can be seen that three models converge to 0.05 when they are trained to the 300th epoch, indicating that the proposed method for detecting marking-points based on Mask R-CNN has certain robustness.

Figure 19 .where
Figure 19.AVM images with different stitching effects.(a) From our experimental vehicle; (b) from the public dataset [49].

Figure 20
shows the loss during the entire training.It can be seen that three models converge to 0.05 when they are trained to the 300th epoch, indicating that the proposed method for detecting marking-points based on Mask R-CNN has certain robustness.

Figure 20 .
Figure 20.Loss of Mask R-CNN models with different training datasets.

Figure 20 .
Figure 20.Loss of Mask R-CNN models with different training datasets.

Figure 21 .
Figure 21.AVM images with parking slots detected by Mask R-CNN.(a-f) From the experimental vehicle; (g-i) from the public dataset [49].

Figure 21 .
Figure 21.AVM images with parking slots detected by Mask R-CNN.(a-f) From the experimental vehicle; (g-i) from the public dataset [49].
[42]generation of the AVM image is the process of mapping from the bird's-eye view coordinate system to the original fisheye image.The key step is to generate the mapping table TB→F, then to take a pixel point F out of the original fisheye image and fill it into a corresponding position B in the bird'seye view image based on OpenCL[42]quickly and efficiently.As shown in Figure2, the mapping process and related coordinate transformation of the bird's-eye view image are explained.

Table 1 .
Marking-points detection speed among different methods.

Table 1 .
Marking-points detection speed among different methods.

Table 2
[49]s the slot-detection performances among different methods.The test dataset consists of 256 AVM images from the experimental vehicle and the public dataset[49], which includes various types of parking slots under complex illumination.Precision, recall, accuracy and F1 scores in Formulas 5-8 are utilized as performance measures.
) are utilized as performance measures.

Table 2 .
Slot-detection performance among different methods.