Convolutional Neural Network for Crowd Counting on Metro Platforms

: Owing to the increased use of urban rail transit, the ﬂow of passengers on metro platforms tends to increase sharply during peak periods. Monitoring passenger ﬂow in such areas is important for security-related reasons. In this paper, in order to solve the problem of metro platform passenger ﬂow detection, we propose a CNN (convolutional neural network)-based network called the MP (metro platform)-CNN to accurately count people on metro platforms. The proposed method is composed of three major components: a group of convolutional neural networks is used on the front end to extract image features, a multiscale feature extraction module is used to enhance multiscale features, and transposed convolution is used for upsampling to generate a high-quality density map. Currently, existing crowd-counting datasets do not adequately cover all of the challenging situations considered in this study. Therefore, we collected images from surveillance videos of a metro platform to form a dataset containing 627 images, with 9243 annotated heads. The results of the extensive experiments showed that our method performed well on the self-built dataset and the estimation error was minimum. Moreover, the proposed method could compete with other methods on four standard crowd-counting datasets.


Introduction
Owing to the rapid development of urban rail transit, the lines of operation are expanding, passenger flow continues to increase [1], and rail operators face daunting safety-related challenges in this context. Crowd density in metro stations increases sharply in peak periods of travel. As a large crowd gathers at metro stations and passenger flows increase, the risk of stampedes increases. Therefore, it is important to analyze passenger flow by monitoring videos of the metro platform, analyzing their content, and identifying abnormalities using computer vision and artificial intelligence [2,3]. According to information on real-time passenger flows and crowd densities in different areas, people on a platform can be guided to avoid stampedes, improving the security and efficiency of metro stations.
A considerable amount of research has been conducted on analyzing the flow of passengers through metro stations based on surveillance videos [2,4,5]. In [2], passenger flow in a given target area was detected using the background difference method, but this method cannot be used to count the number of passengers on the metro platform. Background difference is more suitable for the detection of continuously moving objectives; passengers waiting on the metro platform are mostly stationary. The authors of [4] proposed a strategy to detect passenger flow on a metro platform based on the bodies of the passengers. In a sparse scene, this method performs well, but the metro platform is highly crowded at times, as shown in Figure 1. In such cases, images captured by the camera feature significant occlusions that cause this method to miss some targets and incorrectly identify others. In [5], the authors proposed a crowd monitoring approach for feature significant occlusions that cause this method to miss some targets and incorrectly identify others. In [5], the authors proposed a crowd monitoring approach for metro platforms using an improved mixture of Gaussian background modeling to segment the crowd. People in the crowd are counted by linear regression. This method regards the crowd as a whole and uses the regression relationship between features of the image and the crowd to count the passengers. It can solve the problem of occlusion and count people in a large crowd. However, the accuracy of this method is low owing to the limited population information provided by the crowd. Crowd counting methods aim to estimate the number of people in surveillance videos or a single image. They can be used in a variety of scenarios, such as political assemblies, sports events, and concerts, to ensure public safety by monitoring crowd density. Currently available methods of crowd counting are developed from detection-based [4,[6][7][8] and regression-based [5,9,10] approaches and convolutional neural network (CNN)based [11][12][13][14][15][16] approaches. As CNN-based methods use the human head as the target of detection, the error caused by occlusion is reduced. Therefore, convolutional-neural-network-based crowd counting methods are more suitable for use on metro platforms.
The presence of screen doors, elevators, and other small facilities on metro platforms, as well as changes in lighting, can cause severe occlusion and reflection problems in surveillance videos captured by monitoring probes. This seriously affects the accuracy of crowd counting. The problems of occlusion and reflection pose significant challenges to crowd counting at metro platforms. To solve the occlusion and reflection problem, we then propose a convolutional neural network for crowd counting called the MP-CNN. The proposed architecture uses VGG-16 [17] as the front-end network for feature extraction. The VGG is known to have excellent feature extraction capability and strong transfer learning ability on classification tasks. It also has flexible architecture, which makes it easy to connect it to the back-end network and generate a density map. Inspired by the work in [18], we also introduce a multiscale feature extraction module to enhance the multiscale for feature extraction capability and expand the field of reception of the network. This can improve feature extraction in remote areas of a long and narrow metro platform. We then use a set of transposed convolutions for upsampling, instead of bilinear interpolation, to restore the feature map to its original size and generate a high-quality density map.
We also developed a dataset that contains 627 images of a total of 9243 annotated people for this study. It contains images of a platform at peak and normal periods on weekdays and weekends. Owing to the long and narrow metro platform considered and the angles of the surveillance cameras, the degrees of crowding and occlusion were different. The data were collected from a surveillance video camera on the metro platform. We call it the Metro Platform dataset. The representative images of the proposed dataset are shown in Figure 1.
The main contributions of our work are as follows: Crowd counting methods aim to estimate the number of people in surveillance videos or a single image. They can be used in a variety of scenarios, such as political assemblies, sports events, and concerts, to ensure public safety by monitoring crowd density. Currently available methods of crowd counting are developed from detectionbased [4,[6][7][8] and regression-based [5,9,10] approaches and convolutional neural network (CNN)-based [11][12][13][14][15][16] approaches. As CNN-based methods use the human head as the target of detection, the error caused by occlusion is reduced. Therefore, convolutional-neuralnetwork-based crowd counting methods are more suitable for use on metro platforms.
The presence of screen doors, elevators, and other small facilities on metro platforms, as well as changes in lighting, can cause severe occlusion and reflection problems in surveillance videos captured by monitoring probes. This seriously affects the accuracy of crowd counting. The problems of occlusion and reflection pose significant challenges to crowd counting at metro platforms. To solve the occlusion and reflection problem, we propose a convolutional neural network for crowd counting called the MP-CNN. The proposed architecture uses VGG-16 [17] as the front-end network for feature extraction. The VGG is known to have excellent feature extraction capability and strong transfer learning ability on classification tasks. It also has flexible architecture, which makes it easy to connect it to the back-end network and generate a density map. Inspired by the work in [18], we also introduce a multiscale feature extraction module to enhance the multiscale feature extraction capability and expand the field of reception of the network. This can improve feature extraction in remote areas of a long and narrow metro platform. We then use a set of transposed convolutions for upsampling, instead of bilinear interpolation, to restore the feature map to its original size and generate a high-quality density map.
We also developed a dataset that contains 627 images of a total of 9243 annotated people for this study. It contains images of a platform at peak and normal periods on weekdays and weekends. Owing to the long and narrow metro platform considered and the angles of the surveillance cameras, the degrees of crowding and occlusion were different. The data were collected from a surveillance video camera on the metro platform. We call it the Metro Platform dataset. The representative images of the proposed dataset are shown in Figure 1.
The main contributions of our work are as follows: First, for the sake of public safety, in order to avoid stampede accidents, we propose a convolutional neural network called the MP-CNN for accurate crowd counting on metro platforms. Second, the proposed method, with a multiscale feature extraction module, can solve the problem of severe occlusion and better adapt to environments with severe occlusion and reflection compared to other methods.
Third, we developed a Metro Platform dataset; images in this dataset were gathered from a video stream of a metro station. This dataset has different scenes featuring congestion for analysis in the field of intelligent transportation.
The results of experiments on the four benchmarks show that our method can compete with state-of-the-art crowd counting methods.

Related Work
In recent years, a growing number of studies have considered the problem of crowd counting and proposed algorithms to deal with this task. They can be broadly categorized into traditional methods and CNN-based methods.

Traditional Approach
Early work on crowd counting focused on detection-based methods [6,[19][20][21][22]. Some of them considered the crowd as a group of detected individual pedestrians by using a simple process of detection and summation. Others tackled crowd counting as an object detection problem and used the body, or parts of it, to locate people in images of crowds in order to count them. However, in scenes of dense crowds, these detection-based methods were limited by serious occlusion and background clutter. To handle images of highly congested scenes, regression-based approaches [9,10,23,24] were proposed. They involved learning to map from features of the image to density maps or to a given number of particular objects directly. Using similar approaches, in [24], Idrees et al. proposed a method that fuses the extracted features using Fourier analysis, head detection, and scale-invariant feature transform in local patches. These regression-based methods can predict the global number of people in a crowd but ignore the spatial information in images. A comprehensive survey of these early studies can be found in [25].

CNN-Based Approach
Various CNN-based methods have been proposed and have achieved remarkable success in crowd counting tasks. A majority of them are dedicated to large-scale variations in images of crowds. The authors of [26,27] have summarized the previously proposed CNN-based methods for crowd counting. To cope with the large-scale variation in scenes of crowds, Zhang et al. [12] proposed a simple and effective multicolumn structure to extract features by kernel size. Similarly, in [28], a multiscale model, hydra-CNN, was proposed by Onoro and Sastre to extract image features at different scales. Cao et al. [13] proposed an encoder-decoder network called SANet that employs scaled aggregation modules as an encoder. This method can improve representation capability and the diversity of feature scale. Recently, Wang [15] designed a network called SFCN to encode spatial contextual information based on the VGG-16 [17] and ResNet-101 [29]. The problem of scale variation can be solved by certain techniques, such as dilated kernels [14], multiscale pooling layers [30], multiple decoding paths [31], and multiscale bottom-up and top-down feature fusion [32].
The above studies show that CNN-based solutions can outperform traditional methods of crowd counting. We thus propose a CNN-based network, with pooling layers and dilated convolution [14], to solve the problem, as applies to a metro platform.

Proposed Method
In this section, we first introduce the architecture of the proposed convolutional neural network for crowd counting on metro platforms (MP-CNN), as shown in Figure 2. We then discuss the multiscale feature extraction module (MFEM) and the method for generating ground truth. Finally, we describe details of the training of the proposed method.

Proposed Method
In this section, we first as introduce the architecture of the proposed convolutional neural network for crowd counting on metro platforms (MP-CNN), as shown in Figure 2. We then discuss the multiscale feature as extraction module (MFEM) and the method for generating ground truth. Finally, we describe details of the training of the proposed method.

Architecture
We use the first 13 layers of the VGG-16 [17] as the front-end network for feature extraction and only a 3 × 3 convolution kernel. We chose the VGG as the front end for two reasons. On the one hand, it has excellent feature extraction capability and a strong transfer learning ability for classification tasks; but on the other hand, the VGG has flexible architecture, which makes it easy to connect to the back-end network to generate a density map. After a series of convolution layers and pooling layers in the front-end network, the size of the output feature map is 1/8 of the original input. If as we continue to stack more convolution layers and pooling layers, the size of the output feature map can be further reduced, and it becomes difficult to generate a high-quality density map. Therefore, after processing at the front end, we introduced the MFEM, which can extract deeper information while maintaining the resolution of the output density map. The dilated convolution shown in Figure 3b is used in this module. Dilated convolutional layers are known to significantly improve predictive accuracy on semantic segmentation tasks [33,34].
Because of the downsampling of the image in the feature extraction process, the resolution of the output feature is reduced, and it loses considerable detail. To obtain a highresolution density map, we use a set of transposed convolutions to upsample the image after the MFEM has been used. A transposed convolution is not a completely inverse process of a normal convolution but a special convolution. The image size is first expanded by padding the image with 0 s according to a certain ratio. The convolution kernel is then rotated, and forward convolution is performed, as shown in Figure 3c. Unlike previous methods, we chose a learnable transposed convolution instead of a bilinear interpolation algorithm for upsampling. Transposed convolution is different from bilinear interpolation in that it has parameters that can be learned, which means that it can learn more feature information than bilinear interpolation. The transposed convolution layers are used to restore the spatial resolution of the image. Each transposed convolution layer doubles the size of the feature map, corresponding to the previous max-pooling layer. Three transposed convolution layers are used in the network to generate a high-resolution density map of the same size as the input image. This provides detailed spatial information to facilitate feature learning while training the model.

Architecture
We use the first 13 layers of the VGG-16 [17] as the front-end network for feature extraction and only a 3 × 3 convolution kernel. We chose the VGG as the front end for two reasons. On the one hand, it has excellent feature extraction capability and a strong transfer learning ability for classification tasks; on the other hand, the VGG has flexible architecture, which makes it easy to connect to the back-end network to generate a density map. After a series of convolution layers and pooling layers in the front-end network, the size of the output feature map is 1/8 of the original input. If we continue to stack more convolution layers and pooling layers, the size of the output feature map can be further reduced, and it becomes difficult to generate a high-quality density map. Therefore, after processing at the front end, we introduced the MFEM, which can extract deeper information while maintaining the resolution of the output density map. The dilated convolution shown in Figure 3b is used in this module. Dilated convolutional layers are known to significantly improve predictive accuracy on semantic segmentation tasks [33,34].

Multiscale Feature Extraction Module
Owing to the complex distribution of passengers waiting on a metro platform, the perspective of the camera, and other problems, the head size of passengers in the captured images varies. In addition, reflections from screen doors on the platform, elevators, and other small facilities cause complex changes in background information. These problems pose daunting challenges to the crowd counting task on the metro platform. Previously proposed methods, such as the L2SM [35] and S-DCNet [36], have focused on fusing feature maps from different CNN layers to acquire multiscale information through a feature pyramid network structure. In this paper, we introduce a multiscale feature extraction module to solve this problem. This is the first time we have applied this method to the crowd counting task of a metro platform. The proposed MFEM improves multiscale feature extraction to enhance the information in each layer of the feature map.
As shown in Figure 4, the MFEM first compresses the channel of the feature map via a 1 × 1 convolution and then processes the compressed feature map by dilated convolu- Because of the downsampling of the image in the feature extraction process, the resolution of the output feature is reduced, and it loses considerable detail. To obtain a high-resolution density map, we use a set of transposed convolutions to upsample the image after the MFEM has been used. A transposed convolution is not a completely inverse process of a normal convolution but a special convolution. The image size is first expanded by padding the image with 0 s according to a certain ratio. The convolution kernel is then rotated, and forward convolution is performed, as shown in Figure 3c. Unlike previous methods, we chose a learnable transposed convolution instead of a bilinear interpolation algorithm for upsampling. Transposed convolution is different from bilinear interpolation in that it has parameters that can be learned, which means that it can learn more feature information than bilinear interpolation. The transposed convolution layers are used to restore the spatial resolution of the image. Each transposed convolution layer doubles the size of the feature map, corresponding to the previous max-pooling layer. Three transposed convolution layers are used in the network to generate a high-resolution density map of the same size as the input image. This provides detailed spatial information to facilitate feature learning while training the model.

Multiscale Feature Extraction Module
Owing to the complex distribution of passengers waiting on a metro platform, the perspective of the camera, and other problems, the head size of passengers in the captured images varies. In addition, reflections from screen doors on the platform, elevators, and other small facilities cause complex changes in background information. These problems pose daunting challenges to the crowd counting task on the metro platform. Previously proposed methods, such as the L2SM [35] and S-DCNet [36], have focused on fusing feature maps from different CNN layers to acquire multiscale information through a feature pyramid network structure. In this paper, we introduce a multiscale feature extraction module to solve this problem. This is the first time we have applied this method to the crowd counting task of a metro platform. The proposed MFEM improves multiscale feature extraction to enhance the information in each layer of the feature map.
As shown in Figure 4, the MFEM first compresses the channel of the feature map via a 1 × 1 convolution and then processes the compressed feature map by dilated convolution, with different dilated ratios of 1, 2, 3, and 4 to handle the multiscale features and variations in head sizes in the images. The size of the fixed Gaussian kernel in this paper is set to 15. In the generated density map, the size of each annotated head is 15 × 15; padding the image with some 0 s does not affect the counting result. Dilated convolution expands the receptive field of the convolution kernel while keeping the number of parameters unchanged; the operation speed can be accelerated by doing this. The diagrammatic sketch of dilated convolution is shown in Figure 3b, of which the dilated ratio is 3. The extracted multiscale feature maps are fused by the concatenation operation and a 3 × 3 convolution; the size of the processed feature images is the same as that of the input images.

Ground Truth Generation
In research on crowd counting, the dataset used is typically composed of original images and annotated files. Annotations for images of crowds include points at the center of each passenger's head, which record the two-dimensional (2D) coordinates of each head and the total number of heads. This is required to convert these discrete coordinate points into a density map to predict passenger density.
The ground-truth density map is generated by convolving as each delta function The key component of this design is the dilated convolution layer. A dilated convolution can be defined as follows: Y(l, w) is the output of the dilated convolution from input x(l, w). Filter f (i, j) has the length and width L and W, respectively. Parameter d represents the rate of dilation. When d = 1, a dilated convolution turns into a normal convolution.

Ground Truth Generation
In research on crowd counting, the dataset used is typically composed of original images and annotated files. Annotations for images of crowds include points at the center of each passenger's head, which record the two-dimensional (2D) coordinates of each head and the total number of heads. This is required to convert these discrete coordinate points into a density map to predict passenger density.
The ground-truth density map is generated by convolving each delta function δ(x − x i ) with a normalized Gaussian kernel G σ : where x represents each pixel in a given image, x i is the ith annotated point, and N is the set of all annotated points. The integral of the density map is equal to the number of people in the image. Instead of using geometry-adaptive kernels, as in [12], we use a fixed Gaussian kernel to generate the ground-truth density maps; the spread parameter σ of the Gaussian kernel is set to 15.
The sum of all pixel values gives the number of people in the crowd in the input image. P denotes the number of passengers and is defined as follows: where L represent the length of the density map and W represents the width of the density map. Moreover, Z l,w is the pixel at (l, w) in the generated density map.

Training Details
We trained the proposed MP-CNN in an end-to-end manner. The weight parameters of the VGG net, trained on ImageNet, were used for pretraining. We perform our experiments on an NVIDIA Quadro P4000 GPU, with batch size = 1. An Adam optimizer [37] with a low learning rate of 1 × 10 −5 was used to train the model; all experiments are trained for 500 epochs. The Euclidean distance was used to measure estimation error at the pixel level, as in [12,14,28]. The loss function was defined as follows: In the above equation, θ denotes a set of parameters in the proposed MP-CNN, N is the number of training images, X i represents the input image, and F i denotes the ground-truth density map of image X i . F(X i ; θ) stands for the estimated density map generated by the MP-CNN, parameterized with θ for the sample, and L is the loss between the estimated density map and the ground-truth density map. Our method was implemented on the Pytorch [38] framework.

Experiments
In this section, we first introduce the datasets and evaluation metrics used. The experiments conducted on the Metro Platform dataset are then detailed. They verify that the proposed method can be used for counting passengers on a metro platform. We then compare our method with state-of-the-art methods on four standard datasets to prove its generalization capability. Finally, we report ablation studies to prove the effectiveness of the proposed MFEM used in our method.

Datasets
We evaluated our method on four publicly available crowd counting benchmark datasets as well as the dataset collected for this paper (Metro Platform): ShanghaiTech [12] Part A and Part B, UCF-QNRF [39], and UCF-CC-50 [24].
ShanghaiTech. The ShanghaiTech dataset was developed by [12] and contains 1198 images, with 330,165 annotated people. Each image in this dataset has a different perspective. This dataset consists of two parts: Part A with 482 images and Part B with 716 images. The crowd density varies significantly between Part A and Part B, making the accurate estimation of the crowd more challenging. Images in Part A were randomly collected from the internet, and Part B contains images captured from street views. We used the training and testing set splits provided by the authors; in this way, we had 300 images for training and 182 images for testing in Part A and 400 images for training and 316 images for testing in Part B.
UCF-QNRF. As we all know, UCF-QNRF is the largest and most widely distributed dataset in the domain of crowd counting, reported in [39] in 2018. This dataset contains 1535 images featuring 1,251,642 people, with the centers of their heads annotated, including 1201 images in the training set and 334 images in the test set. A wide variety of scenes are contained, including a diverse set of viewpoints, densities, and variations in lighting. The resolution is higher than in the ShanghaiTech dataset. This makes this dataset more realistic as well as more difficult when counting the number of people in the image.
UCF-CC-50. The UCF-CC-50 dataset [24] contains 50 annotated images of extremely dense crowds. The images were collected mainly from concerts, protests, and marathons, with different crowd densities and perspectives. There is a large variation in crowd numbers, ranging from 94 to 4543. The limited number of images makes it a challenging dataset for deep learning methods. We followed the standard protocol in [24] and used fivefold cross-validation to evaluate the performance of the proposed method on this dataset.
Metro Platform. Crowd counting is important, but the available counting datasets are not specifically designed for metro transportation. Therefore, we collected and labeled a dataset that is specific to metro platforms in order to count the waiting passengers in such areas. The images were captured from a video from a camera at a certain perspective on a metro platform, including the peak and normal periods on weekdays and weekends. The Metro Platform dataset consists of 627 images and 9243 annotations; the resolution of the images is 576 × 768. For the evaluation, we used 465 images from the dataset as the training set and 162 as the testing set. A comparison between the Metro Platform dataset and the other datasets used is shown in Table 1. Table 1. Comparison between the Metro Platform dataset and the other datasets used in this study: Num is the number of images, Total is the total number of labeled people, Ave is the average crowd count, and Max is the maximal crowd count.

Evaluation Metrics
In accordance with previous studies [12][13][14], we used mean absolute error (MAE) and mean squared error (MSE) as metrics to evaluate the accuracy of the methods in terms of counting members of a crowd: In the above equation, N is the number of test samples; Z i and ∧ Z i are the estimated and ground-truth crowd numbers corresponding to the ith sample, which is given by the integration of the density map. Roughly speaking, the MAE indicates the accuracy of the predicted result and the MSE measures its robustness. As the MSE is sensitive to outliers, its value will be large when the model performs poorly on a few samples.

Experiments on the Metro Platform Dataset
The Metro Platform dataset was designed specifically for metro platforms. Due to the angle of the camera, the characteristics of the crowd close to the camera are clear, while those at a long distance from it are blurred, as shown in Figure 1. In addition, the background in the image is more complex, and background information accounts for a large part of the image. The screen door of the metro platform also produces significant reflection. The position of the crowd in each image changes, and the adverse background caused by the reflection also changes. The above problems pose significant challenges for the counting task. To solve the problem of changeable background, we introduced the model trained on the dense crowd datasets as a pretrained model in the experiments. We used the model trained on ShanghaiTech Part A as the pretrained model to evaluate network performance. The results of the comparison are shown in Table 2. Figure 5 shows the density map obtained using the different methods. We adopted memory access cost (MAC) to evaluate the computational complexity. In the same experimental environment, the MAC values of different methods are shown in Table 2. Our method achieves the highest counting accuracy on metro platform scenes, but the network is also more complex. In future work, we will try to lightweight the network. The proposed method was superior to the other methods in the metro platform scenario as it was more accurate in terms of counting the number of passengers in crowds and generated a higher-quality density map. The distribution of passengers on the metro platform can be obtained from the generated density map, and subway staff can dredge the crowd in the crowded area according to the actual situation so as to avoid safety accidents caused by overcrowding in a certain area.
In the future, we will further explore the influence of occlusion and reflection on the counting task of metro platforms. We will try to improve the estimation accuracy in two different ways. First, use hybrid supervised-unsupervised machine learning approaches [40,41] in an attempt to extract more relevant features. Second, preprocess the monitoring video to cut or cover the screen door of the metro platform that had a serious problem of reflection.

Comparisons with State of the Art
The proposed method delivered outstanding performance on all benchmarks. The results of quantitative comparisons with the state-of-the-art methods on four datasets are presented in Tables 3 and 4. A visual comparison is also provided in Figure 6.  ShanghaiTech. We compared the proposed method with multiple classic methods on ShanghaiTech Part A and Part B datasets and found that it yielded a significant improvement in performance. In Part A, our method was superior by 39.2% in terms of the MAE, 34.9% in terms of the MSE to the MCNN, and, respectively, by 1.8% and 2.1% to CSRNet. In Part B, our method was superior by 62.5% in terms of the MAE, 64.6% in terms of the MSE compared to the MCNN, and by 6.6% and 8.8% to CSRNet, respectively.
UCF-QNRF. As we all know, UCF-QNRF is the largest and most widely distributed crowd counting dataset. The proposed method achieved significant improvement over existing methods on this dataset. For instance, RANet [42] achieved a score of 111 in terms of the MAE and 190 in terms of the MSE, whereas our method improved these results by 3.3% in terms of the MAE and 4.3% in terms of the MSE.
UCF-CC-50. We also conducted experiments on the UCF-CC-50 dataset. The crowd numbers in the images varied from 96 to 4633. According to the standard protocol in [24], the dataset was randomly divided into five subsets. We used five-fold cross-validation to evaluate our method. With a small number of training images, our network still converged well on this dataset. Compared with RANet [42], it was better by 5.5% in terms of the MAE and 4.3% in terms of the MSE.

Ablation Experiments
In this section, we report the results of ablation studies on the different datasets used to verify the effectiveness of the proposed MFEM. The experimental results show that

Ablation Experiments
In this section, we report the results of ablation studies on the different datasets used to verify the effectiveness of the proposed MFEM. The experimental results show that while considering the balance of training speed and estimation accuracy, the structure setting of MFEM in Figure 4 is the best choice.
To verify the effectiveness of the MFEM, we used the proposed network structure with MFEM and without it in the training process on different datasets. The results showed that the performance of the proposed method improved when the MFEM was introduced, as shown in Table 5. On the ShanghaiTech Part B dataset, the proposed MFEM improved the performance by 21.4% in terms of the MAE and 25.9% in terms of the MSE. On the UCF-CC-50 dataset, the introduction of MFEM improved the performance by 12.7% and 5.6% in terms of the MAE and MSE, respectively. On the Metro Platform dataset proposed in this paper, the MAE and MSE improved by 33.3% and 29%, respectively, with the introduction of MFEM. This shows that the MFEM can improve counting performance in dense and relatively sparse scenes.

Conclusions
In this paper, we propose a novel method to count the number of people in crowds on metro platforms, called the MP-CNN. We introduced an MFEM to enhance the multiscale feature extraction capability of the network and solved the problems of diverse occlusion and varying head sizes of passengers in the images. This method is of great significance to the public safety of metro platforms; metro staff can guide and drain the flow according to the number of passengers. The effectiveness of the proposed MFEM was verified by comparative experiments. To evaluate its effectiveness on metro platforms in particular, we collected and labeled a new dataset, called the Metro Platform dataset, consisting of 627 images of 9243 annotated people. The results of extensive experiments show that our method delivers excellent results on the proposed Metro Platform dataset and can compete with state-of-the-art methods in four major crowd counting benchmarks.
Author Contributions: Resources, conceptualization, J.Z. and Z.W.; software, methodology, formal analysis, validation, data curation, investigation, writing-original draft, writing-review and editing, visualization, J.L.; funding acquisition, supervision, project administration, Z.W. All authors have read and agreed to the published version of the manuscript.