Object and Event Detection Pipeline for Rink Hockey Games

: All types of sports are potential application scenarios for automatic and real-time visual object and event detection. In rink hockey, the popular roller skate variant of team hockey, it is of great interest to automatically track player movements, positions, and sticks, and also to make other judgments, such as being able to locate the ball. In this work, we present a real-time pipeline consisting of an object detection model specifically designed for rink hockey games, followed by a knowledge-based event detection module. Even in the presence of occlusions and fast movements, our deep learning object detection model effectively identifies and tracks important visual elements in real time, such as: ball, players, sticks, referees, crowd, goalkeeper, and goal. Using a curated dataset consisting of a collection of rink hockey videos containing 2525 annotated frames, we trained and evaluated the algorithm’s performance and compared it to state-of-the-art object detection techniques. Our object detection model, based on YOLOv7, presents a global accuracy of 80% and, according to our results, good performance in terms of accuracy and speed, making it a good choice for rink hockey applications. In our initial tests, the event detection module successfully detected an important event type in rink hockey games, namely, the occurrence of penalties.


Introduction
Rink hockey, also known as quad hockey or roller hockey, is a sport that is both thrilling and fast-paced and is one that captivates players and spectators alike.The game is played on a rink with roller skates, combining elements of ice hockey and roller skating to create an exhilarating game.Historically, this sport has been particularly popular in the following four countries: Portugal, Spain, Italy, and Argentina.As the sport continues to evolve and gain popularity, deep learning and other artificial intelligence techniques have emerged as valuable tools to enhance various aspects of rink hockey, including game visualization, referee work, and training.
In the context of rink hockey, deep learning techniques can be applied to enhance the visualization of the game, thereby facilitating more accurate and detailed analysis of player movements, strategies, and game dynamics.Computer vision algorithms, when deployed in conjunction with strategically placed cameras, can be used to capture high-resolution footage of the rink.This footage is then processed using deep learning models to track the positions and actions of players, the ball, and even referees in real time.
This enhanced game visualization offers several benefits.Firstly, it provides a valuable tool for coaches, allowing them to analyze and strategize more effectively.Such data can be used to identify patterns in player movements, study tactical decisions, and gain insights into opponent behavior.By analyzing the data collected through deep learning techniques, coaches can make informed decisions and develop customized training programs to enhance their players' performance.
Moreover, the application of deep learning-based game visualization has the potential to transform the role of referees in rink hockey.In this context, referees can play a pivotal role in ensuring fair play and enforcing the rules of the game.However, due to the fastpaced nature of rink hockey, it can be challenging for referees to keep up with all the action and make accurate decisions in real time.The incorporation of deep learning techniques enables referees to receive real-time assistance through computer vision systems.Such systems are capable of tracking players, identifying fouls or infractions, and even providing instant feedback to referees.This reduces the potential for human error and improves the overall fairness of the game.
In addition to their application in the field of game visualization and referee work, these automatic techniques can also significantly impact player training in rink hockey.Training in this sport necessitates the acquisition of a combination of technical abilities, physical conditioning, and tactical comprehension.The utilization of deep learning algorithms enables the creation of personalized and optimized training programs, which are designed to meet the specific needs of individual players.The data collected from game visualization can be utilized to identify strengths and weaknesses, monitor progress, and provide targeted feedback to players.This information can be used by coaches to design training sessions that address specific areas for improvement and enhance overall performance.
Furthermore, the application of deep learning techniques can facilitate the identification of potential injury risks through the analysis of player movements.By monitoring the performance data of players, coaches and medical staff are able to identify signs of fatigue, monitor load management, and make informed decisions to minimize the risk of injuries.
This paper proposes a real-time pipeline consisting of an object detection model specifically designed for rink hockey games, followed by a knowledge-based event detection module.Given the limited availability of developed content relating to this sport, we have decided to create a dataset and test the effectiveness of deep learning and computer vision improvements in the real world.Consequently, object detection, specifically employing the YOLOv7 algorithm, can be used to track the movement of the hockey ball and players on the rink, providing coaches, players, and fans with valuable information.Our paper also examines the advantages of using this approach for object and event detection in rink hockey, as well as its potential impact on the sport itself.
The remainder of the paper is organized as follows.Section 2 presents a discussion of related work.Section 3 presents our approach for the automatic analysis of rink hockey games, while Section 4 presents the evaluation and results.Finally, Section 5 presents the conclusions and outlines future work.

Computer Vision Applied to Hockey Sports
Rink hockey and ice hockey are two team sports that share numerous similarities, yet there are also notable differences between them.One of the most significant distinctions between the two sports is the type of equipment employed.Rink hockey is typically played on a dry surface, such as a concrete rink or a sports court, and the playing area is rectangular (Figure 1 right).In contrast, ice hockey is played on an ice rink, which is typically oval in shape (Figure 1 left).The field material is a significant factor influencing the variations in movement patterns and techniques employed by players, as they must adapt to different surfaces and levels of friction and stability.The participants utilize sticks to propel a hard ball/puck into the opposing team's goal, as the game is played on a hard, smooth surface, such as a basketball court or roller rink.The faster pace of play in roller hockey is a consequence of the players' greater mobility on roller skates.Furthermore, the regulations of the two sports diverge in certain respects.For example, in standard conditions on the playing rink, each team in rink hockey is composed of one goalkeeper and four field players.In ice hockey, each team typically comprises six players.Furthermore, the offside rule is frequently employed in ice hockey, in contrast to rink hockey.The latest innovations in object detection techniques have resulted in considerable increases in both the accuracy and speed of object detection [1].Convolutional neural networks (CNNs) represent one of the most extensively used techniques for object detection, having obtained state-of-the-art results on diverse object detection benchmarks [2].One of the primary challenges in rink hockey object detection is the substantial diversity in the appearance of objects due to factors such as lighting conditions, player uniforms, and background noise.Researchers have proposed a number of approaches to address this challenge, including the use of multiple CNNs with distinct architectures [3] and the use of data augmentation techniques to increase the diversity of the training data [4].
In rink hockey, these techniques can be employed for a multitude of purposes, including player tracking, ball tracking, and game analysis.Unfortunately, to date, there has been little work done on object detection in rink hockey.This encompasses both the generation of a dataset and the implementation of deep learning techniques for the associated predictions.However, in relation to ice hockey, there is a plethora of content available, including tracking and identification of players, characteristics of the playing field, and match analysis [5][6][7].
In [5], a system for automatically tracking and identifying players in NHL streamed videos was proposed.The system comprised three components: player tracking, team identification, and player identification.The authors tested five state-of-the-art tracking algorithms on the hockey player tracking dataset [8][9][10][11][12].The most effective tracking performance was achieved by utilizing the MOT Neural Solver tracking model [10], which had been re-trained on the hockey dataset with a Multi-Object Tracking Accuracy (MOTA) score of 94.5%.Furthermore, for the purpose of identifying teams, the away team jerseys were grouped into a single class, whereas the home team jerseys were grouped according to their color.In the case of the team identification dataset, a CNN was trained, achieving an accuracy of 97% on the test set.Furthermore, a novel player identification model was presented that employed a temporal one-dimensional convolutional network to identify players from sequences of player bounding boxes.Utilizing the available National Hockey League (NHL) game roster data, the player identification model achieved an accuracy of 83% in player identification.
In [6], the authors developed a cascaded convolutional neural network (CNN) model comprising two phases for the detection of ice hockey players.In the same work, the jersey color of each of the detected players was extracted in order to determine their team affiliations.A filter was incorporated into the proposed model to exclude distracting information, such as the audience and sideline advertising bars, thereby refining the detection of the targeted players.This resulted in accurate detection with a precision of 98.75% and a recall of 94.11% for individual players and an average accuracy of 93.05% for team classification using a dataset of collected images from the 2018 Winter Olympics.The authors argued that the custom-built dataset enabled their player detection model to achieve the best results when compared to some state-of-the-art approaches such as YOLOv3 [13], Faster R-CNN [14], and CornerNet [15].
Another challenge in rink hockey object detection is the need for real-time performance given the fast-paced nature of the sport and the requirement for object detection updates to be made with regularity.Researchers have proposed a number of solutions to this problem, including the use of lightweight CNN architectures [16] and the implementation of the object detection pipeline on specialized hardware like graphics processing units (GPUs) [17] or field-programmable gate arrays (FPGAs) [18].
In [16], a CNN with three branches and a classification network with four cascades was presented.The cascaded networks were initially trained using labeled image patches and subsequently applied to an entire image by employing a dilation testing strategy.The authors claimed that their approach achieved state-of-the-art accuracy on three types of games (basketball, soccer, and ice hockey) with 1000 fewer parameters than those of CNNs that were adapted from general object detection networks such as Faster R-CNN.
The work presented in [17] proposed an approach to locate the puck, which is the key component of an ice hockey game.One of the principal challenges is the puck's small size.The motion blur caused by the puck's rapid movement, the occlusion between the puck and other objects, and the visual noise (such as the advertisements on the rink) presented additional difficulties for the project.The author proposed a two-stage model for the detection of minute objects.The initial stage involved a two-dimensional CNN that summarized the representation of each frame.In the second stage, the fusion of the per-frame representations was fed into a three-dimensional CNN in order to decode the video's temporal information.The proposed approach achieved a precision of 90.8% and a recall of 86.7%.The F1 score reached 88.7%, which is higher than the performance of YOLOv3 [13] (F1 score = 0.685) and Mask RCNN [19] (F1 score = 0.749).
In [18], a hardware accelerator was presented that implemented a YOLO CNN for realtime object detection with high throughput and power efficiency.The parameters of the implemented YOLO were retrained and also quantized using binary weight and flexible low-bit activation.The binary weight technique enabled the storage of the entire network model in block RAMs of a field-programmable gate array (FPGA).This approach resulted in a reduction in off-chip accesses, thereby achieving a performance boost.All convolutional layers have been pipelined in order to optimize hardware utilization.Additionally, the input images were delivered and processed in a sequential manner to the hardware accelerator.Similarly, the output of the previous layer was transmitted in a sequential manner to the subsequent layer.The intermediate data were fully reused across layers, thereby eliminating memory accesses from the outside.A reduction in DRAM accesses resulted in a reduction in power consumption.Furthermore, the authors argued that the fully parameterized convolutional layers facilitated straightforward scaling of the network.Finally, each convolution layer was mapped to a dedicated block of hardware, thereby enabling the network to outperform other solution designs in terms of energy efficiency and performance.
In [20], a people tracking algorithm tailored for soccer players was presented.The algorithm was designed to cope with a wide range of scenarios, encompassing varying light conditions, high frame rates, and real-time processing.The algorithm employed background subtraction for object segmentation, with the addition of a novel method utilizing pixel energy evaluation to address the challenges of handling moving objects and lighting fluctuations during background modeling.Subsequently, an unsupervised clustering algorithm was employed to classify the detected objects, addressing challenges associated with blob splitting and merging.A stochastic approach, which employs maximum a posteriori probability (MAP), was proposed for the purpose of human tracking.The algorithm initially assessed geometric information regarding blob overlapping and subsequently employed color feature classification to track players and resolve blob merging scenarios.Experimental validations conducted on extensive soccer image sequences across various weather and lighting conditions have demonstrated the effectiveness of the algorithm.
Table 1 provides a summary of the articles discussed in this section, accompanied by a comparison of their respective characteristics.To the best of our knowledge, our approach represents the only solution specifically designed and tested for rink hockey.This sport presents certain distinctive characteristics, and it benefits from having a customized approach.Our approach is suitable for real-time processing, even when utilizing a low-cost processing unit.It is worth noting that, in our tests, we prioritized temporal performance.The YOLOv7 profile was employed as the baseline model for the object detector stage, with a considerable number of parameters (36.9 M parameters, as illustrated in Table 1 of [21]).In our pipeline, we employ object detection as a method to track the semantically meaningful objects in the visual scene.This approach became possible due to the high quality of object detectors that have recently become available.

Automatic Pipeline for Object and Event Detection in Rink Hockey Games
One major contribution of this work is the specification, implementation, and testing of an automatic pipeline capable of detecting and tracking players, the ball, and other in-game objects in real time during fast-paced rink hockey games, as well as some important events, as can be seen in Figure 2.That said, the pipeline was designed to meet the following requirements: • Object detection capabilities for player tracking, ball tracking, and video analysis.• Ability to detect and represent major events based on the stream of visual objects.

Dataset Organization
A major contribution of this work is the organization of the dataset.Our rink hockey dataset was annotated with the aid of a web-based computer vision annotation tool that helps to label video and images.The selected solution, Roboflow, is a popular computer vision development platform that facilitates data collection, preprocessing, and model training.This framework allows easy access to public datasets as well as the ability for users to upload their own custom data [23].Additionally, it supports various annotation formats, including JSON, CSV, and XML.
The dataset was divided into three subsets: training (70%), validation (20%), and testing (10%).The dataset included objects labeled with seven classes: ball, player, stick, referee, crowd, goalkeeper, and goal.The main goal of this annotation process was to produce a dataset that can be used to train object detection models, such as YOLOv7 [24], for transfer learning.
From the total of 2525 annotated frames/images (1764 training, 501 validation, and 260 testing), the image augmentation technique was applied to the training images which, by applying horizontal flipping, cropping with a zoom up to 20%, and changing the brightness changing between −15% (darker) and +15% (brighter), resulted in a total of 5292 images for training.Beyond that, through these and other possible modifications, we artificially increased the size of the dataset.This technique is used not only to increase the size of the dataset and hence the robustness and generalization of the model, but also to try to avoid model overfitting, where a trained model performs better on trained data but poorly on unseen data.In addition to the previously mentioned image augmentation in the training set, we also used the default data augmentation techniques configured in the training process of YOLOv7.Of these techniques, we emphasize mosaic, which allows for the combination of several images to create a single training image with a mosaic appearance [25].As Figure 3 shows, the dataset is slightly unbalanced, with the player and stick classes being the most represented.This is not surprising, since these are the most visible objects in all images.
The resulting dataset is available online [22].It can definitely be useful for a wide range of applications, including player tracking and video analysis of rink hockey games.Being publicly available, this dataset will serve as a valuable resource for other researchers and practitioners within the field of computer vision and sports analysis.

YOLOv7
YOLO (You Only Look Once) [24] is a state-of-the-art object detection system that is highly valuable in the domain of computer vision.In contrast to traditional object detection systems that run a classifier on various parts of the image independently, YOLO innovatively applies a single neural network to the entire image at once.The image is divided into regions, and then the system predicts bounding boxes and probabilities for each region.As such, YOLO significantly enhances the efficiency and speed of real-time object detection, making it a pivotal asset in a wide range of applications.Specifically in the context of hockey, YOLO can be effectively utilized to track the movements of players and the puck in the arena.This can furnish extensive insights into the analysis of gameplay, performance review of players, and the development of strategic gameplay.It is noteworthy that YOLO is not a standalone tool, but rather, it refers to a type of neural network architecture and algorithm.It is commonly implemented using comprehensive deep learning libraries, such as TensorFlow or PyTorch.
The authors in [21] present another version of YOLO, the YOLOv7.Its base architecture and version are significantly better than prior YOLO versions, such as YOLOv4 and YOLOv5, and it has reached state-of-the-art performance on numerous object detection benchmarks, such as the COCO dataset [26].The use of a "bag-of-freebies" is one key feature of YOLOv7.This refers to a set of training strategies that do not require the usage of additional computational resources and may thus be utilized to increase the performance of the model without significantly increasing its complexity.Data augmentation, network architecture changes, and multiscale training are examples of these strategies.By using these techniques, this version is able to detect objects at many scales, resulting in better results on object detection tasks while still running in real time on standard hardware.Overall, apart from its high precision object detection abilities, YOLOv7 is also known for its efficiency and speed, delivering outstanding performance without losing speed or simplicity, making it a valuable tool for a wide range of applications.Typically a visual object detection solution, such as YOLOv7, offers several model options with increasing degrees of complexity, that is, with an increasing number of parameters.In the case of YOLOv7, the models range from YOLOv7-tiny to YOLOv7-E6E.There is a trade-off between speed and accuracy.The simpler, the faster and the more complex, the more accurate.Figure 4 illustrates a comparison between YOLOv7 and other important visual object detectors, showing its better performance.These characteristics make YOLOv7 an appropriate base model for a real-time application, such as the solution proposed in this article, and justify its use in the tests we carried out.

Object Detection for Rink Hockey
In this section, we present the technical details of the developed (https://github.com/LuisMota1999/ProjetoVC, accessed on 19 January 2023) object detection system for rink hockey utilizing YOLOv7.To facilitate the training of the model, we constructed a dataset composed of various rink hockey scenarios, which were carefully annotated to include different perspectives, lighting conditions, data augmentation, and object scales.The YOLOv7 architecture was implemented using the PyTorch framework [27], and the pretrained weights were fine-tuned on the dataset we were training.From the several YOLOv7 pre-trained weight ranges available (see Figure 4) we selected, in our tests, the YOLOv7, AP min-val = 51.20%, a good compromise between complexity and real-time performance (https://github.com/WongKinYiu/yolov7/tree/main,accessed on 19 January 2023).Our experiments demonstrated that the model achieved high accuracy in detecting rink hockey objects such as the crowd, goalkeeper, and players.For real-time object detection, we used multiple online videos from professional rink hockey games [28][29][30][31][32][33][34][35][36][37] to acquire the feed and processed the frames using the trained YOLOv7 model.Our results demonstrate the ability of our model to detect visual objects in the video with low latency, thereby making it suitable for real-time applications.Additionally, we leveraged CUDA and cuDNN [38] for GPU acceleration, which significantly improved the processing speed of the model.The conjunction with various technologies such as CUDA, OpenCV, and PyTorch has enabled the development of a robust and efficient object detection system for rink hockey.

Event Detection Rules Module
Event detection follows object detection in the pipeline shown in Figure 2. The proposed specification of an event detection module, whose input is a timeline stream of detected visual objects, or the output of the object detection model, can comprise up to four stages, as follows: 1.
Filtering: due to the noisy nature of the object detection model, a filtering operator, such as a windowed spatio-temporal median filter, can be used as a pre-processing stage before the application of a rule-based detector.

2.
Model-based Rule Detection: rule-based system that can detect predefined types of events.

3.
Representation: the language or taxonomy used to represent the events.Event calculus is a formal method of event representation [39].4.
Revision: can be used to revise or update represented event beliefs due to new information available in the stream of detected visual objects.
This module follows the object detection module and, consequently, depends on it according to the directed acyclic graph (DAG) presented in Figure 2. In the preliminary tests performed, a proof-of-concept approach was adopted for this module, and the detection of a unique type of event was considered for this article.

Object Detection
In our experiments, we trained a YOLOv7-based model for object detection using footage acquired, essentially, from professional rink hockey games.We performed both qualitative and quantitative evaluations to assess the performance of this model.
The evaluation process consisted of applying YOLOv7 to detect objects of interest in multiple video frames from rink hockey games [40,41] and comparing the results to manually annotated ground truth labels.We used precision (p), recall (r), and F1 score (F 1 ), the harmonic mean of precision and recall, as the three quantitative metrics to evaluate the performance of the system, according to the following equations: The model was trained using 100 epochs and a batch size of 32, with the system being implemented on a Windows desktop with a Zotac Gaming GeForce RTX 3090 Trinity OC 24 GB GDDR6X GPU and 64 GB of RAM, using the Darknet framework in Python 3.7.0.
When testing on frames not seen in the training phase (present in the test subset), the overall performance was quite good (around 80%).
The results of the evaluation are presented in the following figures.Figure 5 presents a multi-frame inferred plot of the test subset, showing the labels that the trained model detected.Figure 6 represents the trade-off between precision and recall in the classification model.As can be seen from the graph, the precision-recall values are higher for the player, goalkeeper, and goal classes, as we had more annotated images for these classes.However, the precision-recall for the ball is lower, as we have fewer annotated images for this class.It is important to note that, in our evaluation, we only considered detections with an objective confidence score above 0.3.This threshold was chosen to balance the trade-off between detecting more objects while maintaining a slightly higher precision.
Also, as seen in Figure 6, our trained model got a score of 0.764 mAP@0.5 and, in this context, 0.764 represents the mAP score, i.e., the mean average precision (the average of AP over all detected classes, calculated for each class separately) ranging between 0 and 1.The parameter 0.5 represents the IoU (intersection over union) selected threshold; in other words, the model is able to detect objects with an overlap of at least 50% with the ground truth.Having said that, in the context of this work, a mAP score of 0.764 is a fairly good score for object detection models.
In follow-up to what was previously said, we can also use the F 1 score to evaluate the model performance.The F 1 score is a metric that is calculated as the harmonic mean of precision and recall.This metric also has a range of 0 to 1, with 1 representing perfect precision and recall and 0 representing the worst possible value.Beyond that, precision reflects the number of samples that were predicted as positive and are actually positive (ground truth), and recall measures the number of actual positive samples that were predicted correctly as positive.As shown in the Figure 7, the model has a maximum F 1 score of 0.79 at 0.240.Analyzing this graph reveals that there is a large plateau with F 1 > 0.75 approximately parallel to the confidence axis, which can be considered a very good indicator of the model behavior.The metrics F 1 score, precision, and recall are all derived from the values of the confusion matrix, which is another useful tool for understanding the performance of a classification model, since it shows where the model is making mistakes and how it performs on different classes.In this case, as can be seen in the normalized confusion matrix obtained in Figure 8, when evaluating the test subset.In other words, on the images that were not seen by the model while training, we got an accuracy of 80%.This figure also provides the information that the ball and crowd classes were the ones with the worst results.

Event Detection
For the rule production system used for event detection, one rule, illustrated in Algorithm 1, was tested to detect the occurrence of penalties or direct free hits, according to the existing official rink hockey regulations [42].This rule considers several predicates.The predicate T(expression, t 1 , t 2 ) is true if expression is true for the temporal interval between t 1 and t 2 .Remaining predicates are auto-explicated and have, as their argument, the updated list LO of timestamped and automatically detected visual objects.Predicates starting with the prefix is mean the presence of at least one visual object in that frame for that class.Predicates starting with the prefix count return the total number of visual objects in that frame for that class.
According to the current regulations [42], the player taking the penalty or the direct free hit has, after the indication of the main referee, a maximum of five seconds to start the execution with the ball stopped.Thus, in this rule, the temporal window duration, ∆t = (t 2 − t 1 ), should be appropriately chosen.In our qualitative analysis, we set ∆t = 4 s.In the quantitative preliminary tests, we annotated a total of 23 penalty images: 11 from the training set; 7 from the validation set; and 5 from the test set.Figure 9 illustrates an example of a penalty detected by our rule where six distinct visual object instances are considered: The penalty and direct free hit sanction executions are probably the most important ones in rink hockey.Thus, their detection is rather important as an event.Figure 5 (row, col) = (3, 4) illustrates an example of a penalty event during the game that was correctly detected by this rule.

Conclusions
The purpose of our investigation is to advance the field of research in object and event detection in rink hockey sports.This has the potential to significantly enhance the analysis and comprehension of this sport, which is characterized by fast-paced gameplay.
The use of object detection techniques has been instrumental in the tracking of objects in rink hockey videos.Nevertheless, the diversity in the appearance of objects and the need for real-time performance remain significant challenges in this field.
This research employed a state-of-the-art object detection methodology for rink hockey, utilizing the YOLOv7 algorithm in conjunction with a knowledge-based event detection module.This approach has demonstrated good object detection performance in terms of precision, even in the presence of occlusions and rapid motion.
For future work, it is essential to enlarge the dataset with more annotated data in order to balance the dataset and compare it with other algorithms.It is important to keep testing and comparing with other methods and techniques.Finally, it should be noted that this work could be practically applied in collaboration with roller hockey clubs and hockey federations.This would enable the development and testing of this approach, as well as the continuous evolution of this sport.

Figure 2 .
Figure 2. Pipeline for real-time detection of hockey objects and events.

•
Predicate isRe f eree(LO) is satisfied by the visual object of class referee; • Predicate isGoal(LO) is satisfied by the visual object of class goal; • Predicate isGoalkeeper(LO) is satisfied by the visual object of class goalkeeper; • Predicate countPlayer(LO) == 1 is satisfied by the presence of exactly one instance of class player per frame; • Predicate T( f , t 1 , t 2 ) is satisfied by the truthfulness of f , where f = isRe f eree(LO) ∧ isGoal(LO) ∧ isGoalkeeper(LO) ∧ countPlayer(LO) == 1 in at least one frame within the time interval [t 1 , t 2 ] and according to the test criteria adopted.

Table 1 .
Comparative analysis between different computer vision systems.
• Real-time performance for updating object detections regularly during fast-paced games.
• Ability to handle diversity in the appearance of objects due to factors such as lighting conditions, player uniforms, and background noise.• High accuracy in object detection.• High speed response in object detection.• Compatibility with specialized hardware such as GPUs.• Adaptability to different environments and lighting conditions.