1. Introduction
Person tracking and action-recognition algorithms for video streaming data recently focus a lot of interest, inspired and motivated by variety of possible applications. There are numerous approaches and achievements, often multistage and complicated, leading to different types of information extracted from data. Their further improvement and evolution require elaborating reliable evaluation and benchmarking scenarios. Existing benchmarking methods involve testing the reliability of tracking and action-recognition algorithms by employing public datasets manually annotated by their authors. PETS [
1] is the first, well-known dataset created primarily for surveillance applications. The original release consisted of three subsets of benchmark data, the first designed for pedestrian count and density analysis, the second for pedestrian tracking, and the last for analysis of traffic flow and event recognition. With respect to evaluation of tracking methods, the most popular dataset is included in MOTChallenge [
2,
3,
4,
5], which includes available datasets with ground truth, validation MATLAB scripts and the possibility of uploading results so as to rank methods with respect to quality and accuracy.
However, manual preparation of annotations of benchmarking data leads to substantial limitations and shortages. The first limitation concerns data volume and resolution, which clearly must be downsized due to constraints imposed by “throughput” of human operators. The second limitation is insufficient replicability and possible biases introduced by disagreements between annotations of different human experts (operators). These limitations and shortages lead to biases and errors in evaluating algorithms for tracking and action-recognition [
2]. Without sufficient variety and replicability of benchmarking data evaluations of tracking and action-recognition algorithms are likely to overestimate their efficiencies due to the phenomenon of multiple testing with overfitting of design parameters. Limitations of existing benchmarking approaches lead many authors to create their own testing data records, a challenging and time-consuming task [
6]. In order to prepare such a dataset, authors have actors perform specially designed scenarios and then they again manually annotate where and what actions were performed. However, such in home prepared benchmarks typically lead to severe problems in their lack of standardization and/or irreproducibility [
4,
5]. Limitations of manually prepared data were also observed by the authors of this paper, during work on practical implementation of the surveillance project [
7].
The above described problems justify the need for conducting serious research towards elaborating machine—generated benchmarking systems of large variability, resolution and volume. Methodologies for developing benchmarking methodologies for person tracking and action-recognition algorithms must address challenges related to the need for elaborating applications for many different tracking [
8] and action-recognition [
9] environments, including the analysis of surveillance video and urban data understanding. The desired task is to track multiple objects employing simple and compound action recognition and many other aspects of computer vision. The quality of these analyses can be influenced by variations in weather conditions, lighting levels, occlusions of people, and changes in camera position. Algorithms, which have proved their usefulness in such analyses over the last decade, now must deal with relatively larger datasets than previously due to optimization and the rapid growth of computational power [
5].
Realistic behavior and motion can be generated by employing the general concept of a crowd simulator, a piece of software that allows a user to determine the movement and behavior of a user-defined number of crowd members, who are often referred to as actors, or agents, in specific, variable circumstances [
10,
11,
12,
13,
14,
15,
16]. Crowd simulators can be divided into two main categories: those that represent autonomic crowd behaviors in the most realistic and faithful way and those that aim to produce persons/groups visualizations of the highest possible graphical fidelity for visual human reception. The primary use of simulators belonging to the first category is to aid in the design of effective evacuation routes from buildings or during mass events. Realistic simulations of crowd behavior allow testing of safety solutions cost effectively in a contained environment, further improving their quality. Examples of such simulators are the MASSIS framework [
17] and the PedSim, available as a C++ library [
18]. Agent positions generated by PedSim can be used in a user’s graphical engine of choice to produce visualizations. While a crowd’s behavior can typically be modeled with high accuracy in this type of simulator, the visual side of things is treated in a purely utilitarian way, often too minimalistic or overly simplified (or simply lacking) for the needs of computer vision. The second category of simulator is used primarily to create visual effects in movies and animations. Simulators of this type allow a scene to be populated with realistically looking and behaving virtual actors. The most popular professionally used simulators of this type are the Golaem Crowd [
19], Miarmy (Autodesk Maya plugin), and the MASSIVE environment [
20]. Such software can produce high-quality visualizations, but the crowd agents often lack autonomy and customizability.
W. van Toll, M. et al. [
21] defines a five-level hierarchy for path planning in crowd simulation systems. This hierarchy allows convenient division of all necessary tasks required to create realistic motion in crowd simulation systems. As of yet, many video game engines, including Unity, Unreal Engine, or CryEngine, provide useful, convenient, and sufficient tools for path planning and overall control of agents during route following. Unity game engine enables its NavMesh Agent [
22] to address this issue. Kristinsson [
23] describes a system based on Unity 3D that along with a properly set animation controller component and adequate animation clips, can employ motion–capture technology to record video clips and high-resolution 3D models, thereby creating a realistic crow simulation system. In one way to approach this problem, Forbus et. al. [
24] describe a video game entitled “The Sims” created by Maxis Studios which gives a player the ability to create and control a virtual family over many generations. Agents, or sims, can perform various actions to achieve objects worldwide or with one another. Actions performed by agents, or indirectly controlled Sims, occur when a user creates a temporal-interaction object with a proper animation clip or animation offset in order to impart the feel of a real-world situation. To achieve such an effect, every action needs to be manually defined by the user. The Sims comes with a virtual machine, The Edith, designed specifically for this purpose. Musee et al. [
25] propose a different approach in which a crowd simulator simulates the motion of synthetic pedestrians drawn from samples gathered from real-world video sequences using object tracking techniques. The simulator uses the trajectories of pedestrians gathered from video sequences to simulate pedestrians moving within a simulated environment. In [
26], the authors report creation of an interactive data-driven crowd simulator that combines the high realism of data-driven methods with interactivity of synthetic techniques.
As previously mentioned, crowd simulators have already numerous applications including not only gaming software but also city and environments planning, organizing commercial spaces, designing and verification of evacuation paths for buildings or terrains etc. All these applications inspire fast advances in crowd simulators. In given work a new application of crowd simulators is presented. The CrowdSim (crowd simulation system) was designed and employed as a validation tool for tracking video recordings. To the best of the authors knowledge, CrowdSim is the first system of this type that allows generation of random test data with high variability and controlled complexity.
A graphical system is designed that creates simulated records together with matching annotations. With this system, a potential user can generate a sequence of random images incorporating differing times of day, weather conditions, and scenes for the purpose of tracking evaluation. Published algorithms can thus be tested with respect to the influence of crowd density or the impacts of different light conditions related to weather, both of which are difficult to capture when real-world videos are employed. The proposed solution was tested and validated on different algorithms for tracking multiple objects and these results were compared with those obtained from MOTChallenge. The presented work has the following features:
random generation of images that can be used to evaluate different algorithms for tracking multiple objects by means of random starting position of pedestrians, their unpredictable interactions during simulation process and choice of different models
application of the Unity game engine for crowd simulation
automatic annotation of object detections and names of actions according to the MOT format
application of prepared models, generated scenes, and changes in time of day and weather conditions during a simulation
4. Discussion
The experience coming from using automated, crowd simulation-based approach to video tracking evaluation is its flexibility and ease of implementation. Evaluation scenarios can be very easily planned, data are readily presented to video tracking algorithms leading to reliable scorings and ordering of algorithms. CrowdSim tool allows comprehensive evaluation and prioritization of any available state-of-the-art method of video tracking. Here 6 methods have been evaluated which were developed over 2013–2017. That choice was dictated by availability and functionality of the tested tools. The authors of some of the methods available in the latest MOT have not released the source code [
39,
40,
41,
42,
43,
44,
45,
46,
47,
48]. Some of the available methods were compiled but unfortunately required additional code modifications, input format adjustments, or generated erroneous results [
49,
50,
51,
52].
Important aspect of accessing presented method was comparison to existing approaches, mainly to presently most commonly applied MOTChallenge database. For easy situations obtained ordering of video tracking algorithms highly coincides with MOT ranking. However, scenarios which are presently poorly covered by manually prepared data, rainy, foggy or snowy condition can strongly influence evaluations. IOU method, which achieves top MOT scores, for data generated by CrowdSim is also most efficient. Increasing number of pedestrians or adding rain or snow effect decreases scoring of all compared methods below 90. CrowdSim tool for benchmarking tracking algorithms generates video streams, which are less realistic than real images e.g., obtained from monitoring systems. However, in contrast to real-world, simulated images are much more repeatable and have all parameters under control. Therefore, despite the gap in their reality, they allow drawing more robust conclusions concerning quality of tracking algorithms, as it has been shown in experiments. Clearly, by advancing proposed technology of generating artificial images, in the future versions the authors can come closer to the reality of simulated scenes.
The performed analyses imply that the MOTChallenge ranking cannot be treated as a reference data for all situations. Using collections of video streams prepared for this ranking does not allow for controlling important parameters affecting efficiency of tracking, light conditions number of pedestrians, weather conditions. The data available in the MOTChallenge are too sparse to study differences in values of tracking efficiency as functions of these parameters. The MOTChallenge scorings yielded lower values of the MOTA and MOTP parameters in terms of crowd simulations. For MOTP, the variability in results is much lower than for the MOTA parameter.
4.1. Application of the Concept of Crowd Simulations
Proposed system of automatic generation of different scenes ensures that tracking methods designers cannot overfit their parameters by targeting only limited sets of real videos. Examples of a crowd simulator show that it can be applied to verify and validate tracking algorithms. The main benefit of such an approach is in the wider possibility of comparison and generation of different situations for a scene. The CrowdSim enables verification of the impact of crowd density (i.e., the number of pedestrians in a scene) on tracking algorithms. In general, it can be observed that tracking methods suffered from a greater number of problems with increased numbers of pedestrians. On the other hand, the influence of three different weather conditions (as well as associated light) on the quality of tracking was checked and it was observed that the main difference occurred within the rain and fog conditions. Obtained results demonstrated the main trends with respect to parameters, and the idea of crowd simulations in tracking evaluation can be developed and performed under conditions of increasing sophistication. Although given approach was compared with the results from MOTChallenge, the main goal was not to obtain exactly the same parameter values, an impossibility. MOTChallenge contains videos created under different conditions, which were presented in the rankings in averaged form. On the other hand, not all methods were given in each ranking, and there were only a few working open-source codes (even for those methods which topped rankings). In method verification, the authors focused primarily on the MOTA and MOTP parameters, which are the most complex. It was possible to check the methods that yielded repeatable results and the conditions in which algorithms gave different results, and finally the efficiency of the tracking methods included in the analysis was verified.
To reach a compromise between the quality of simulations and the computation time of tracking methods, it was decided to choose 10 FPS and the resolution of 800 × 600. The main intention was to verify the influence of crowd density and weather conditions, so the number of repetitions is much higher than in single videos in MOT. Current video surveillance systems range from 10–25 FPS of quality but still many popular benchmarks have similar parameters (Kitti [
53]—10 FPS, PETS09 [
1]—7 FPS). Lower FPS value is still dictated by video storage and data transmission. In IPVM report [
54] it can be observed that average frame rate in video surveillance increased into 15 FPS in 2019 (40% of 11–15 FPS and 28% of 6–10 FPS) from 10 FPS in 2016 (51% of 6–10 FPS and 32% of 11–15 FPS). Furthermore, the group from University College London proved that the minimum frame rate level to detect events in similar video systems is 8 FPS [
55]. The authors of [
56] treat the real-time efficiency in the value of 10 FPS. Nevertheless, CrowdSim enables modification and increase (or decrease) of either frame rate or image resolution so that potential user can adjust images to their needs.
4.2. Future Works
The proposed solution solves many complex problems but nonetheless requires additional development in the future. For the creation of human movement, built-in tools were used that are available in Unity and external tools. Nevertheless, the simulation must be equipped with some additional mechanism capable of making human movement more natural with limited possibilities of occlusions. In the case of action recognition and more advanced simulation of the movement of individual agents, it is planned to use an additional animation library, as well as record own animations in motion capture systems. It should be considered replacing the navigation mesh with planned movement in which all behavior is planned a priori and is simulated by the software. An additional element that makes the simulations real is the addition of a background in the form of other moving objects that do not participate in the tracking process. There are also plans to prepare first-person images that could move with the crowd. A scene should consist of more realistic elements including traffic jams in the example of moving cars. The crowd simulations themselves should be more complex, consist of more differently looking people, and generate more problems for tracking algorithms. In order to improve the reality of scenarios such as weather conditions future versions of simulator should be equipped with different kind of post-processing methods as well as shaders implemented for that case.
5. Conclusions
In this paper, a novel approach to validation of video tracking algorithms was proposed. The crowd simulation system for video tracking benchmarking was prepared in a game engine environment. Previous approaches used real-world collections of video streams, catalogued and annotated manually. The proposed methodology was inspired by problems, which follow from weaknesses of manual preparation schemes mentioned in several literature references and commonly encountered in preparation and validation of real-world surveillance systems, limited volume and resolution of data streams, insufficient replicability, problems in planning and implementing desired scenarios.
In the proposed algorithmic solution and elaborated implementation “CrowdSim”, resulting images are automatically annotated in the commonly used convention and can be readily applied to evaluation of tracking algorithms. The output of “CrowdSim” reproduces formats and standards, in particular those of MOTChallenge, already in use for benchmarking video surveillance systems. All output indexes for quality of tracking are reliably reproduced. Simulation scenarios can represent variety of situations and can be controlled by parameters such as time of day, crowd level, weather conditions, and background. Parameters can appear with different levels. Studies on sensitivities of algorithms to different variants, different levels of values of disturbing parameters can easily be conducted.
To confirm the usefulness of the proposed approach several tests were conducted concerning comparisons of applications of different tracking algorithms (IOU, TCF, ELP, TBD, HDH, DCT) to the generated data. Significant improvements were demonstrated by comparison of the described approach to existing benchmarking systems based on manual annotations. CrowdSim made possible exhaustive comparisons of the studied tracking algorithms with respect to several important indexes, crowd density, weather conditions (rain, snow, fog/steam), lighting conditions, time of day. Significant differences in responses to changes in the indexes between video tracking systems were observed, ranging one order of magnitude for MOTA parameter. In contrast, manually annotated systems are likely to flatten ranges of quality indexes of compared algorithms due to insufficient variability of parameters.
Several directions of future development of the elaborated system are possible, improving realism of generated images, increasing variability ranges of parameters, combining different indexes in video streams. Developing new versions of crowd simulators will contribute to improvements in efficiency of the constructed surveillance systems. CrowdSim is freely available at the dedicated webpage (
https://crowdsim.aei.polsl.pl) along with several simulated benchmark data sets.