Improving Deep Object Detection Algorithms for Game Scenes

: The advancement and popularity of computer games make game scene analysis one of the most interesting research topics in the computer vision society. Among the various computer vision techniques, we employ object detection algorithms for the analysis, since they can both recognize and localize objects in a scene. However, applying the existing object detection algorithms for analyzing game scenes does not guarantee a desired performance, since the algorithms are trained using datasets collected from the real world. In order to achieve a desired performance for analyzing game scenes, we built a dataset by collecting game scenes and retrained the object detection algorithms pre-trained with the datasets from the real world. We selected ﬁve object detection algorithms, namely YOLOv3, Faster R-CNN, SSD, FPN and EfﬁcientDet, and eight games from various game genres including ﬁrst-person shooting, role-playing, sports, and driving. PascalVOC and MS COCO were employed for the pre-training of the object detection algorithms. We proved the improvement in the performance that comes from our strategy in two aspects: recognition and localization. The improvement in recognition performance was measured using mean average precision (mAP) and the improvement in localization using intersection over union (IoU).


Introduction
Computer games have been one of the most popular applications for all generations since the dawn of the computing age. Recent progress in computer hardware and software has presented computer games of high quality. Nowadays, e-sports, playing or watching computer games, have become some of the most popular sports. E-sports are newly emerging sports where professional players compete in highly popular games, such as Starcraft and League of Legends (LoL), while millions of people watch them. Consequently, e-sports have become one of the most popular types of content on various media channels, including YouTube and Tiktok. From these trends, analyzing game scenes by recognizing and localizing objects in the scenes has become an interesting research topic.
Among the many computer vision algorithms including object recognition and object detection, localization and segmentation are candidates for analyzing game scenes. In analyzing game scenes, both recognizing and localizing objects in the scene are required. Therefore, we select object detection algorithms for analyzing game scenes. Object detection algorithms can identify thousands of objects and draw bounding boxes for objects in realtime. At this point, we have a question in relation to applying object detection algorithms to game scenes: "Can the object detection algorithms trained by real scenes be applied to game scenes?" Detecting objects in game scenes is not a straightforward problem that can be resolved by applying existing object detection algorithms. The recent progress in computing hardware and software techniques presents diverse visually pleasing rendering styles to computer games. Some games are rendered in a photorealistic style, while some are in a cartoon style. Furthermore, various depictions of a game scene with various colors and tones present a distinctive game scene style. Some cartoon-based games present their deformed characters and objects according to their original cartoons. Therefore, detecting various objects in diverse games can be challenging.
Existing deep-learning-based object detection algorithms show satisfactory detection performance for images captured from the real world. We selected five of the most widelyused deep object detection algorithms: YOLOv3 [1], Faster R-CNN [2], SSD [3], FPN [4] and EfficientDet [5]. We also prepared two frequently used datasets, PascalVOC [6,7] and MS COCO [8], for training the object detection algorithms. We examined these algorithms in recognizing objects in game scenes.
We aimed to improve the performance of object recognition of these algorithms by retraining them using game scenes. We prepared eight games including various genres, such as first-person shooting, racing, sports, and role-playing. Two of the selected games presented cartoon-styled scenes. We excluded games with non-real objects. In many fantasy games, for example, dragons, orcs, and non-existent characters appear. We excluded these games since existing object detection algorithms are not trained to detect dragons or orcs.
We also tested a data augmentation scheme that produces cartoon-styled images for the images in frequently used datasets. Several widely used image abstraction and cartoon-styled rendering algorithms were employed for the augmentation process. We retrained the algorithms using the augmented images and measured their performances.
To prove that the performance of the object detection algorithms was improved using game scene datasets, we compared the comparison for two cases. One case was to compare PascalVOC and PascalVOC with game scenes, and the other case was to compare MS COCO and MS COCO with game scenes. For each case of comparisons, the five object detection algorithms were pre-trained with a frequently used dataset. After measuring the performance, we retrained the algorithms with game scenes and measured the performance. These performances were compared to prove our hypothesis that the object detection algorithms trained with the public dataset and game scenes showed better performance than the algorithms trained only with the public dataset.
We compared the pre-trained and retrained algorithms in terms of two metrics: mean average precision (mAP) and intersection over union (IoU). We examined the accuracy of recognizing objects with mAP and the accuracy of localizing objects with IoU. From this comparison, we could determine whether the existing object detection algorithms could be used for game scenes. Furthermore, we could also determine whether the object detection algorithms retrained with game scenes showed a significant difference from the pre-trained object detection algorithms.
The contributions of this study are summarized as follows: • We built a dataset of game scenes collected from eight games. • We presented a framework for improving the performance of object detection algorithms on game scenes by retraining them using game scene datasets. • We tested whether the augmented images using image abstraction and stylization schemes can improve the performance of the object detection algorithms on game scenes.
This study is organized as follows. Section 2 briefly explains deep-learning-based object detection algorithms and presents several works on object detection techniques in computer games. We elaborate on how we selected object detection algorithms and games in Section 3. In Section 4, we explain how we trained the algorithms and present the resulting figures. In Section 5, we analyze the results and answer our RQ. Finally, we conclude and suggest future directions in Section 6.

Deep Object Detection Approaches
Object detection, which extracts a bounding box around a target object from a scene, is one of the most popular research topics in computer vision. Many object detection algorithms have been presented after the emergence of the histogram of the oriented gradient (HoG) algorithm [9]. Recently, the progress of deep learning techniques has accelerated object detection algorithms on a great scale. Many recent works, including the you only look once (YOLO) series [1,10,11], the region with a convolutional neural network (R-CNN) series [2,12,13], spatial pyramid pooling (SPP) [14] and the single-shot multibox detector (SSD) [3], have demonstrated impressive results in detecting diverse objects from various scenes. YOLO detects objects by decomposing an image into S × S grid cells. We estimate B bounding boxes at each cell, each of which possesses a box confidence score and C conditional class probabilities. The class confidence score, which estimates the probability of an object belonging to a class in the cell, is computed by multiplying the box confidence score with the conditional class probability. YOLO is a CNN that estimates the class confidence score for each cell. Although YOLOv1 [10] has a very fast computational speed, it suffers from relatively low mAP and limited classes for detection. Redmon et al. later presented YOLOv2, also known as YOLO9000, which detects 9000 objects with improved precision [11]. They further improved YOLOv2's performance in YOLOv3 [1].
R-CNN, which is another mainstream deep object detection algorithm, employs a two-pass approach [12]. The first pass extracts a candidate region, where an object should go through a selective search and a region proposal network. In the second, they recognize the object and localize it using a convolutional network. Girshick presented fast R-CNN, improving computational efficiency [13], and Ren et al. presented faster R-CNN [2].
The SPP algorithm allows arbitrary size input for object detection [14]. It does not crop or warp input images to avoid distortion of the result. It devises an SPP layer before the fully connected (FC) layer to fix the size of feature vectors extracted from the convolution layers. SSD addresses the problem of YOLO, which neglects objects smaller than the grid [3]. The SSD algorithm applies an object detection algorithm to each feature map extracted through a series of convolutional layers. The detected information is merged into a final detection result by executing a fast non-maximum suppression.
FPN builds a pyramid structure on the images by reducing their resolutions [4]. FPN extracts features in a top-down approach and merges the extracted features in both highresolution images and low-resolution images. In the high-resolution images, the features in low-resolution images are employed to predict the features in high-resolution images. The pyramid structure of FPN extracts more semantics on the features in low-resolution images. Therefore, FPN extracts features from the input image in a convincing way.

Object Detection in a Game
Utsumi et al. [15] presented a classical object detection and tracking method for a soccer game in the early days. They employed a color rarity and local edge property for their object detection scheme. They extracted objects with high edges from a roughly single-colored background. Compared to a real soccer game scene, their model shows a comparatively high detection rate.
Many researchers have applied the recent progress of deep-learning-based object detection algorithms to individual games.
Chen and Yi [16] presented a deep Q-learning approach for detecting objects in 30 classes from the classic game Super Smash Bros. They proposed a single-frame 19layered CNN model, with five convolution layers, three pooling layers and three FC layers. Their model recorded 80% top-1 and 96% top-3 accuracies.
Sundareson [17] chose a specific data flow for in-game object classification. Their model also aimed to detect objects in virtual reality (VR). They converted 4K input images into 256 × 256 resolution for efficiency. Their model's performance exhibits very competitive results in implementing in-game and in-VR object classification using CUDA.
Venkatesh [18] surveyed and proposed SmashNet, a CNN-based object tracking scheme in games. This model recorded 68.25% classification precision for four characters in fighting games by employing very effective structures. The author also developed KirbuBot, which performs basic commands on the positions of two tracked characters.
Liu et al. [19] employed faster R-CNN to implement the vision system of a game robot. They extracted features and position label mapping for a single object using ResNet100. The information about object movement in a game is tracked using the robot's camera to improve the accuracy and speed of the model recognition.
Chen et al. [20] attempted to address the multi-object tracking problem, which is crucial in video analysis, surveillance and robot control, using a deep-learning-based tracking method. They applied their method to a football video to demonstrate the performance of their method.
Tolmacheva et al. [21] used a YOLOv2 model to track a puck in an air hockey game. The air hockey game is an arcade game played by two players who aim to push a puck into the opponent's goal by moving a small hand-held stick. Since the puck moves with great velocity, exact detecting and tracking of an object is a challenging problem. They collected and prepared datasets from game images to predict the trajectory of a fixed object. Using YOLOv2 in a C implementation, they recorded 80% detection accuracy.
Yao et al. [22] presented a two-stage algorithm to detect and recognize "hero" characters from a video game named Honor of Kings. They applied a template-matching method to detect all heroes in the frames and devised a deep convolutional network to recognize the name of each detected hero. They employed InceptionNet and recorded a 99.6% F1 score with less than 5 ms recognition time.
Spijkerman and van der Harr [23] presented a vehicle recognition scheme for Formula One game scenes using several object detection algorithms, including HoG, support vector machine and faster R-CNN. They trained their models using images captured from the F1 2019 video game. Their models' precision and recall scores based on R-CNN, the best among the three models, record 97% and 99%, respectively. They applied the trained R-CNN model to real objects and achieved 93% precision.
Kim et al. [24] improved the performance of a safety zone monitoring system using game-engine-based internal traffic control plans (ITCPs). They used a deep-learning-based object detection algorithm to recognize and detect workers and types of equipment from aerial images. They also monitored unsafe activities of works by observing four rules. Through this approach, they emphasized the importance of a digital ITCP-based safety monitoring system.
Recently, the YOLO model has been employed with Unity to present a very effective model for object detection in games [25].

Selected Deep Object Detection Algorithms
We found many excellent deep object detection algorithms in the recent literature. Among these algorithms, we selected the most highly-cited algorithms: YOLO [1,10,11], R-CNN [2,12,13], and SSD [3]. Among various versions of YOLO algorithms, we selected YOLOv3 [1], which detects 9000 objects very effectively. For R-CNN algorithms, we selected Faster R-CNN [2], the cutting-edge version of the R-CNNs. Although these algorithms are highly cited, we needed to select a recent algorithm. Therefore, we selected EfficientDet [5]. Therefore, we compared four deep object detection algorithms in our study: YOLOv3, Faster R-CNN, SSD and EfficientDet. The architectures of these algorithms are compared in Figure 1.

Selected Games
We had three strategies for selecting games in our study. The first strategy was to select games over various game genres. Therefore, we referred to Wikipedia [26] and sampled game genres including action, adventure, role-playing, simulation, strategy, and sports. The second strategy was to exclude games with objects that existing object detection algorithms cannot recognize. Many role-playing games include fantasy items such as dragons, wyverns, titans, or orcs, which are not recognized by existing algorithms. We also excluded strategy games since they include weapons such as tanks, machine guns, and jet fighters that are not recognized. Our third strategy was to sample both photo-realistically rendered games and cartoon-rendered games. Although most games are rendered photorealistically, some games employ cartoon-styled rendering because of their uniqueness. Games whose original story is based on cartoons tend to preserve cartoon-styled rendering. Therefore, we sampled cartoon-rendered games to test how the selected algorithms can detect cartoon-styled objects.
We selected games for our study from these genres as evenly as possible. For action and adventure games, we selected 7 Days to Die [27], Left 4 Dead 2 [28] and Gangstar New Orleans [29]. For simulation, we selected Sims4 [30], Animal Crossing [31], and Doraemon [32]. For sports, we selected Asphalt 8 [33] and FIFA 20 [34]. Among these games, Animal Crossing and Doraemon are rendered in a cartoon style. Figure 2 shows illustrations of the selected games.

Training
We retrained the existing object detection algorithms using two datasets: PascalVOC and game scenes. We sampled 800 game scenes: 100 scenes from 8 games we selected. We augmented the sampled game scenes in various schemes: flipping, rotation, controlling hues and controlling tone. By alternating these augmentation schemes, we could build more than 10,000 game scenes for retraining the selected algorithms.
We trained and tested the algorithms on a personal computer with an Intel Pentium i7 CPU and nVidia RTX 2080 GPU. The time required for re-training the algorithms is presented in Table 1.

Results
The result images on sampled eight samples comparing pre-trained algorithms and re-trained algorithms are presented in Appendix A. We have presented our results according to the following strategies: recognition performance measured by mAP, localization performance measured by IoU and various statistics. We measured mAP, IoU and various statistic values including average IoU, precision, recall, F1 score and accuracy for the five object recognition algorithms with two datasets.

Measuring and Comparing Recognition Performance Using mAP
In Table 2, we compare mAP values for the five algorithms between the Pascal VOC dataset and the Pascal VOC dataset with game scenes. We show the same comparison on the MS COCO dataset in Table 3. In Figure 3, we illustrate the comparisons presented in Tables 2 and 3.

Measuring and Comparing Localization Performance Using IoU
In Table 4, we compare IoU values of the five algorithms between the Pascal VOC dataset and the Pascal VOC dataset with game scenes. We show the same comparison on the MS COCO dataset in Table 5. In Figure 4, we illustrate the comparisons presented in Tables 4 and 5.

Measuring and Comparing Various Statistics
In Tables 6 and 7, we estimate the average IoU, precision, recall, F1 score and accuracy of the five algorithms for the Pascal VOC dataset and the MS COCO dataset. In Figure 5, we illustrate the comparisons presented in Tables 6 and 7.  Figure 5. Mean IoU, precision, recall, F1 score and accuracy are compared between two different datasets. In the left column, blue bars denote the values from those models trained using PascalVOC only and red bars are for PascalVOC + game scenes. In the right column, blue bars denote the values from those models trained using MS COCO only and red bars are for MS COCO + game scenes.

Analysis
To prove our claim that the object detection algorithms retrained with game scenes show better performance than the object detection algorithms trained only with existing datasets such as Pascal VOC and MS COCO, we asked the following research questions (RQ).
RQ1. Does our strategy to retrain existing object detection algorithms with game scenes improve mAP? RQ2. Does our strategy to retrain existing object detection algorithms with game scenes improve IoU?

Analysis of mAP Improvement
To answer RQ1, we compared and analyzed mAP values suggested in Tables 2 and 3, which compare the mAP values of the object detection algorithms trained only with the existing datasets and retrained with game scenes. An overall observation reveals that the retrained object detection algorithms show better mAP than the pre-trained algorithms for 61 of all 80 cases. For further analysis, we performed a t-test and measured the effect size using Cohen's d value. Table 8 compares the p values for the five algorithms trained by PascalVOC and retrained by PascalVOC + game scenes. From the p values, we found that the results from three of the five algorithms are distinguished for p < 0.05. The results from EffficientDet are distinguished even for p < 0.01. Table 9 compares the p values for the five algorithms trained by MS COCO and retrained by MS COCO + game scenes. From the p values, we found that the results from four of the five algorithms are distinguished for p < 0.05.

t-Test
From these results, we show that seven cases from all ten cases exhibit significantly distinguishable results for p < 0.05.

Cohen's d
We also measured the effect size using Cohen's d value for the mAP values and present the results in Tables 10 and 11.
Since four Cohen's d values in Table 10 are greater than 0.8, we can conclude that the effect size of retraining the algorithms using game scenes is great for four algorithms.
We also suggest the Cohen's d values measured from the MS COCO dataset in Table 11, where four Cohen's d values are greater than 0.8. We can also conclude that the effect size of retraining the algorithms using game scenes is great for four algorithms.

Analysis on the Improvement of IoU
To answer RQ2, we compared and analyzed IoU values suggested in Tables 4 and 5 that compare the IoU values of the object detection algorithms trained only with existing datasets and retrained with game scenes. From these values, we found that the retrained object detection algorithms show better IoU for 68 of all 80 cases. For further analysis, we performed a t-test and measured the effect size using Cohen's d value. Table 12 compares the p values for the five algorithms trained by PascalVOC and PascalVOC + game scenes. From the p-values, we found that the results from all the five algorithms are distinguished for p < 0.05. Therefore, our strategy to retrain the algorithms with game scenes shows a significant improvement for localization. Table 13 compares the p values for the five algorithms trained by MS COCO and MS COCO + game scenes. From the p-values, we found that the results from three algorithms are distinguished for p < 0.05.

t-Test
From these results, we have demonstrated that eight cases from all ten cases show a significant distinguishable results for p < 0.05.   Tables 14 and 15.
Since four Cohen's d values in Table 10 are greater than 0.8, we can conclude that the effect size of retraining the algorithms using game scenes is great for four algorithms.
We also suggest the Cohen's d values measured from the MS COCO dataset in Table 11, where three Cohen's d values are greater than 0.8. We can also conclude that the effect size of retraining the algorithms using game scenes is great for three algorithms.  In summary, mAP is improved for 61 of 80 cases and IoU for 68 of 80 cases. When we performed a t-test on p < 0.05, 7 of 10 cases showed a significantly unique improvement for mAP and 8 of 10 cases for IoU. When we measured the effect size, 8 of 10 cases showed a large effect size for mAP and 7 of 10 for IoU. Therefore, we can answer the research questions as the object detection algorithms retrained with game scenes show an improved mAP and IoU compared with the algorithms trained only with public datasets including PascalVOC and MS COCO.

Training with Augmented Dataset
An interesting approach for improving the performance of object detection algorithms on game scenes is to employ augmented images from datasets such as Pascal VOC or MS COCO. In several studies, intentionally transformed images are generated and employed to train pedestrian detection [35,36]. In our approach, stylization schemes are employed to render images in some game scene style. The stylization schemes we employ include flow-based image abstraction with coherent lines [37], color abstraction using bilateral filters [38] and deep cartoon-styled rendering [39].
In our approach, we augmented 3000 images by applying three stylization schemes [37][38][39] and retrained object detection algorithms. Some of the augmented images are suggested in Figure 6. In Table 16, we present a comparison between mAP values from pre-trained algorithms and mAP values from retraining with the augmented images. We tested this approach for the Pascal VOC dataset.
Among the eight games we used for the experiment, the scenes from Doraemon show similar styles to the augmented images. It is interesting to note that this approach shows somewhat improved results on the scenes from Doraemon. For other game scenes, we cannot recognize the improvement in the results. Figure 7 illustrates the comparison of three approaches: (i) trained with Pascal VOC, (ii) retrained with augmentation and (iii) retrained with game scenes.   [37], (c) is produced by color abstraction using a bilateral filter [38] and (d) is produced by deep cartoon-styled rendering [39].

Conclusions and Future Work
This study proved that the object detection algorithms retrained using game scenes show an improved performance compared with the algorithms trained only with the public datasets. Pascal VOC and MS COCO, two of the most frequently used datasets, were employed for our study. We tested our approach for five widely used object detection algorithms, YOLOv3, SSD, Faster R-CNN, FPN and EfficientDet, and for eight games from various genres. We estimated mAP between the pre-trained and retrained algorithms to show that object recognition accuracy is improved. We also estimated IoU to show that the accuracy of localizing objects is improved. We also tested data augmentation schemes that can be applied for our purpose, which shows very limited results according to the style of game scenes.
We have two further research directions. One direction is to establish a dataset about game scenes to improve the performance of existing object detection algorithms on game scenes. We aim to include various non-existent characters such as dragons, elves or orcs. Another direction is to modify the structure of the object detection algorithms to optimize them on game scenes.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
In Appendix A, we present eight figures (Figures A1-A8) that sample the results for eight games by five important object detection algorithms: YOLOv3, SSD, Faster R-CNN, FPN and EfficientDet.