Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (19)

Search Parameters:
Keywords = egocentric video

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
18 pages, 2688 KiB  
Article
Generalized Hierarchical Co-Saliency Learning for Label-Efficient Tracking
by Jie Zhao, Ying Gao, Chunjuan Bo and Dong Wang
Sensors 2025, 25(15), 4691; https://doi.org/10.3390/s25154691 - 29 Jul 2025
Viewed by 135
Abstract
Visual object tracking is one of the core techniques in human-centered artificial intelligence, which is very useful for human–machine interaction. State-of-the-art tracking methods have shown their robustness and accuracy on many challenges. However, a large amount of videos with precisely dense annotations are [...] Read more.
Visual object tracking is one of the core techniques in human-centered artificial intelligence, which is very useful for human–machine interaction. State-of-the-art tracking methods have shown their robustness and accuracy on many challenges. However, a large amount of videos with precisely dense annotations are required for fully supervised training of their models. Considering that annotating videos frame-by-frame is a labor- and time-consuming workload, reducing the reliance on manual annotations during the tracking models’ training is an important problem to be resolved. To make a trade-off between the annotating costs and the tracking performance, we propose a weakly supervised tracking method based on co-saliency learning, which can be flexibly integrated into various tracking frameworks to reduce annotation costs and further enhance the target representation in current search images. Since our method enables the model to explore valuable visual information from unlabeled frames, and calculate co-salient attention maps based on multiple frames, our weakly supervised methods can obtain competitive performance compared to fully supervised baseline trackers, using only 3.33% of manual annotations. We integrate our method into two CNN-based trackers and a Transformer-based tracker; extensive experiments on four general tracking benchmarks demonstrate the effectiveness of our method. Furthermore, we also demonstrate the advantages of our method on egocentric tracking task; our weakly supervised method obtains 0.538 success on TREK-150, which is superior to prior state-of-the-art fully supervised tracker by 7.7%. Full article
Show Figures

Figure 1

35 pages, 2865 KiB  
Article
eyeNotate: Interactive Annotation of Mobile Eye Tracking Data Based on Few-Shot Image Classification
by Michael Barz, Omair Shahzad Bhatti, Hasan Md Tusfiqur Alam, Duy Minh Ho Nguyen, Kristin Altmeyer, Sarah Malone and Daniel Sonntag
J. Eye Mov. Res. 2025, 18(4), 27; https://doi.org/10.3390/jemr18040027 - 7 Jul 2025
Viewed by 494
Abstract
Mobile eye tracking is an important tool in psychology and human-centered interaction design for understanding how people process visual scenes and user interfaces. However, analyzing recordings from head-mounted eye trackers, which typically include an egocentric video of the scene and a gaze signal, [...] Read more.
Mobile eye tracking is an important tool in psychology and human-centered interaction design for understanding how people process visual scenes and user interfaces. However, analyzing recordings from head-mounted eye trackers, which typically include an egocentric video of the scene and a gaze signal, is a time-consuming and largely manual process. To address this challenge, we develop eyeNotate, a web-based annotation tool that enables semi-automatic data annotation and learns to improve from corrective user feedback. Users can manually map fixation events to areas of interest (AOIs) in a video-editing-style interface (baseline version). Further, our tool can generate fixation-to-AOI mapping suggestions based on a few-shot image classification model (IML-support version). We conduct an expert study with trained annotators (n = 3) to compare the baseline and IML-support versions. We measure the perceived usability, annotations’ validity and reliability, and efficiency during a data annotation task. We asked our participants to re-annotate data from a single individual using an existing dataset (n = 48). Further, we conducted a semi-structured interview to understand how participants used the provided IML features and assessed our design decisions. In a post hoc experiment, we investigate the performance of three image classification models in annotating data of the remaining 47 individuals. Full article
Show Figures

Figure 1

26 pages, 2873 KiB  
Article
Interactive Content Retrieval in Egocentric Videos Based on Vague Semantic Queries
by Linda Ablaoui, Wilson Estecio Marcilio-Jr, Lai Xing Ng, Christophe Jouffrais and Christophe Hurter
Multimodal Technol. Interact. 2025, 9(7), 66; https://doi.org/10.3390/mti9070066 - 30 Jun 2025
Viewed by 612
Abstract
Retrieving specific, often instantaneous, content from hours-long egocentric video footage based on hazily remembered details is challenging. Vision–language models (VLMs) have been employed to enable zero-shot textual-based content retrieval from videos. But, they fall short if the textual query contains ambiguous terms or [...] Read more.
Retrieving specific, often instantaneous, content from hours-long egocentric video footage based on hazily remembered details is challenging. Vision–language models (VLMs) have been employed to enable zero-shot textual-based content retrieval from videos. But, they fall short if the textual query contains ambiguous terms or users fail to specify their queries enough, leading to vague semantic queries. Such queries can refer to several different video moments, not all of which can be relevant, making pinpointing content harder. We investigate the requirements for an egocentric video content retrieval framework that helps users handle vague queries. First, we narrow down vague query formulation factors and limit them to ambiguity and incompleteness. Second, we propose a zero-shot, user-centered video content retrieval framework that leverages a VLM to provide video data and query representations that users can incrementally combine to refine queries. Third, we compare our proposed framework to a baseline video player and analyze user strategies for answering vague video content retrieval scenarios in an experimental study. We report that both frameworks perform similarly, users favor our proposed framework, and, as far as navigation strategies go, users value classic interactions when initiating their search and rely on the abstract semantic video representation to refine their resulting moments. Full article
Show Figures

Figure 1

15 pages, 1990 KiB  
Article
Watermark and Trademark Prompts Boost Video Action Recognition in Visual-Language Models
by Longbin Jin, Hyuntaek Jung, Hyo Jin Jon and Eun Yi Kim
Mathematics 2025, 13(9), 1365; https://doi.org/10.3390/math13091365 - 22 Apr 2025
Viewed by 700
Abstract
Large-scale Visual-Language Models have demonstrated powerful adaptability in video recognition tasks. However, existing methods typically rely on fine-tuning or text prompt tuning. In this paper, we propose a visual-only prompting method that employs watermark and trademark prompts to bridge the distribution gap of [...] Read more.
Large-scale Visual-Language Models have demonstrated powerful adaptability in video recognition tasks. However, existing methods typically rely on fine-tuning or text prompt tuning. In this paper, we propose a visual-only prompting method that employs watermark and trademark prompts to bridge the distribution gap of spatial-temporal video data with Visual-Language Models. Our watermark prompts, designed by a trainable prompt generator, are customized for each video clip. Unlike conventional visual prompts that often exhibit noise signals, watermark prompts are intentionally designed to be imperceptible, ensuring they are not misinterpreted as an adversarial attack. The trademark prompts, bespoke for each video domain, establish the identity of specific video types. Integrating watermark prompts into video frames and prepending trademark prompts to per-frame embeddings significantly boosts the capability of the Visual-Language Model to understand video. Notably, our approach improves the adaptability of the CLIP model to various video action recognition datasets, achieving performance gains of 16.8%, 18.4%, and 13.8% on HMDB-51, UCF-101, and the egocentric dataset EPIC-Kitchen-100, respectively. Additionally, our visual-only prompting method demonstrates competitive performance compared with existing fine-tuning and adaptation methods while requiring fewer learnable parameters. Moreover, through extensive ablation studies, we find the optimal balance between imperceptibility and adaptability. Code will be made available. Full article
(This article belongs to the Special Issue Artificial Intelligence: Deep Learning and Computer Vision)
Show Figures

Figure 1

39 pages, 1298 KiB  
Systematic Review
Vision-Based Collision Warning Systems with Deep Learning: A Systematic Review
by Charith Chitraranjan, Vipooshan Vipulananthan and Thuvarakan Sritharan
J. Imaging 2025, 11(2), 64; https://doi.org/10.3390/jimaging11020064 - 17 Feb 2025
Cited by 1 | Viewed by 1412
Abstract
Timely prediction of collisions enables advanced driver assistance systems to issue warnings and initiate emergency maneuvers as needed to avoid collisions. With recent developments in computer vision and deep learning, collision warning systems that use vision as the only sensory input have emerged. [...] Read more.
Timely prediction of collisions enables advanced driver assistance systems to issue warnings and initiate emergency maneuvers as needed to avoid collisions. With recent developments in computer vision and deep learning, collision warning systems that use vision as the only sensory input have emerged. They are less expensive than those that use multiple sensors, but their effectiveness must be thoroughly assessed. We systematically searched academic literature for studies proposing ego-centric, vision-based collision warning systems that use deep learning techniques. Thirty-one studies among the search results satisfied our inclusion criteria. Risk of bias was assessed with PROBAST. We reviewed the selected studies and answer three primary questions: What are the (1) deep learning techniques used and how are they used? (2) datasets and experiments used to evaluate? (3) results achieved? We identified two main categories of methods: Those that use deep learning models to directly predict the probability of a future collision from input video, and those that use deep learning models at one or more stages of a pipeline to compute a threat metric before predicting collisions. More importantly, we show that the experimental evaluation of most systems is inadequate due to either not performing quantitative experiments or various biases present in the datasets used. Lack of suitable datasets is a major challenge to the evaluation of these systems and we suggest future work to address this issue. Full article
Show Figures

Figure 1

16 pages, 3440 KiB  
Article
Towards Automatic Object Detection and Activity Recognition in Indoor Climbing
by Hana Vrzáková, Jani Koskinen, Sami Andberg, Ahreum Lee and Mary Jean Amon
Sensors 2024, 24(19), 6479; https://doi.org/10.3390/s24196479 - 8 Oct 2024
Cited by 2 | Viewed by 2082
Abstract
Rock climbing has propelled from niche sport to mainstream free-time activity and Olympic sport. Moreover, climbing can be studied as an example of a high-stakes perception-action task. However, understanding what constitutes an expert climber is not simple or straightforward. As a dynamic and [...] Read more.
Rock climbing has propelled from niche sport to mainstream free-time activity and Olympic sport. Moreover, climbing can be studied as an example of a high-stakes perception-action task. However, understanding what constitutes an expert climber is not simple or straightforward. As a dynamic and high-risk activity, climbing requires a precise interplay between cognition, perception, and precise action execution. While prior research has predominantly focused on the movement aspect of climbing (i.e., skeletal posture and individual limb movements), recent studies have also examined the climber’s visual attention and its links to their performance. To associate the climber’s attention with their actions, however, has traditionally required frame-by-frame manual coding of the recorded eye-tracking videos. To overcome this challenge and automatically contextualize the analysis of eye movements in indoor climbing, we present deep learning-driven (YOLOv5) hold detection that facilitates automatic grasp recognition. To demonstrate the framework, we examined the expert climber’s eye movements and egocentric perspective acquired from eye-tracking glasses (SMI and Tobii Glasses 2). Using the framework, we observed that the expert climber’s grasping duration was positively correlated with total fixation duration (r = 0.807) and fixation count (r = 0.864); however, it was negatively correlated with the fixation rate (r = −0.402) and saccade rate (r = −0.344). The findings indicate the moments of cognitive processing and visual search that occurred during decision making and route prospecting. Our work contributes to research on eye–body performance and coordination in high-stakes contexts, and informs the sport science and expands the applications, e.g., in training optimization, injury prevention, and coaching. Full article
Show Figures

Figure 1

48 pages, 5649 KiB  
Article
Multimodal Dictionaries for Traditional Craft Education
by Xenophon Zabulis, Nikolaos Partarakis, Valentina Bartalesi, Nicolo Pratelli, Carlo Meghini, Arnaud Dubois, Ines Moreno and Sotiris Manitsaris
Multimodal Technol. Interact. 2024, 8(7), 63; https://doi.org/10.3390/mti8070063 - 18 Jul 2024
Cited by 1 | Viewed by 2095
Abstract
We address the problem of systematizing the authoring of digital dictionaries for craft education from ethnographic studies and recordings. First, we present guidelines for the collection of ethnographic data using digital audio and video and identify terms that are central in the description [...] Read more.
We address the problem of systematizing the authoring of digital dictionaries for craft education from ethnographic studies and recordings. First, we present guidelines for the collection of ethnographic data using digital audio and video and identify terms that are central in the description of crafting actions, products, tools, and materials. Second, we present a classification scheme for craft terms and a way to semantically annotate them, using a multilingual and hierarchical thesaurus, which provides term definitions and a semantic hierarchy of these terms. Third, we link ethnographic resources and open-access data to the identified terms using an online platform for the representation of traditional crafts, associating their definition with illustrations, examples of use, and 3D models. We validate the efficacy of the approach by creating multimedia vocabularies for an online eLearning platform for introductory courses to nine traditional crafts. Full article
Show Figures

Figure 1

16 pages, 20538 KiB  
Article
Exploring the Role of Video Playback Visual Cues in Object Retrieval Tasks
by Yechang Qin, Jianchun Su, Haozhao Qin and Yang Tian
Sensors 2024, 24(10), 3147; https://doi.org/10.3390/s24103147 - 15 May 2024
Viewed by 1494
Abstract
Searching for objects is a common task in daily life and work. For augmented reality (AR) devices without spatial perception systems, the image of the object’s last appearance serves as a common search assistance. Compared to using only images as visual cues, videos [...] Read more.
Searching for objects is a common task in daily life and work. For augmented reality (AR) devices without spatial perception systems, the image of the object’s last appearance serves as a common search assistance. Compared to using only images as visual cues, videos capturing the process of object placement can provide procedural guidance, potentially enhancing users’ search efficiency. However, complete video playback capturing the entire object placement process as visual cues can be excessively lengthy, requiring users to invest significant viewing time. To explore whether segmented or accelerated video playback can still assist users in object retrieval tasks effectively, we conducted a user study. The results indicated that when video playback is covering the first appearance of the object’s destination to the object’s final appearance (referred to as the destination appearance, DA) and playing at normal speed, search time and cognitive load were significantly reduced. Subsequently, we designed a second user study to evaluate the performance of video playback compared to image cues in object retrieval tasks. The results showed that combining the DA playback starting point with images of the object’s last appearance further reduced search time and cognitive load. Full article
(This article belongs to the Topic Theories and Applications of Human-Computer Interaction)
Show Figures

Figure 1

15 pages, 641 KiB  
Article
A Multi-Modal Egocentric Activity Recognition Approach towards Video Domain Generalization
by Antonios Papadakis and Evaggelos Spyrou
Sensors 2024, 24(8), 2491; https://doi.org/10.3390/s24082491 - 12 Apr 2024
Cited by 4 | Viewed by 2438
Abstract
Egocentric activity recognition is a prominent computer vision task that is based on the use of wearable cameras. Since egocentric videos are captured through the perspective of the person wearing the camera, her/his body motions severely complicate the video content, imposing several challenges. [...] Read more.
Egocentric activity recognition is a prominent computer vision task that is based on the use of wearable cameras. Since egocentric videos are captured through the perspective of the person wearing the camera, her/his body motions severely complicate the video content, imposing several challenges. In this work we propose a novel approach for domain-generalized egocentric human activity recognition. Typical approaches use a large amount of training data, aiming to cover all possible variants of each action. Moreover, several recent approaches have attempted to handle discrepancies between domains with a variety of costly and mostly unsupervised domain adaptation methods. In our approach we show that through simple manipulation of available source domain data and with minor involvement from the target domain, we are able to produce robust models, able to adequately predict human activity in egocentric video sequences. To this end, we introduce a novel three-stream deep neural network architecture combining elements of vision transformers and residual neural networks which are trained using multi-modal data. We evaluate the proposed approach using a challenging, egocentric video dataset and demonstrate its superiority over recent, state-of-the-art research works. Full article
Show Figures

Figure 1

20 pages, 2103 KiB  
Article
Fusion of Appearance and Motion Features for Daily Activity Recognition from Egocentric Perspective
by Mohd Haris Lye, Nouar AlDahoul and Hezerul Abdul Karim
Sensors 2023, 23(15), 6804; https://doi.org/10.3390/s23156804 - 30 Jul 2023
Cited by 2 | Viewed by 1383
Abstract
Vidos from a first-person or egocentric perspective offer a promising tool for recognizing various activities related to daily living. In the egocentric perspective, the video is obtained from a wearable camera, and this enables the capture of the person’s activities in a consistent [...] Read more.
Vidos from a first-person or egocentric perspective offer a promising tool for recognizing various activities related to daily living. In the egocentric perspective, the video is obtained from a wearable camera, and this enables the capture of the person’s activities in a consistent viewpoint. Recognition of activity using a wearable sensor is challenging due to various reasons, such as motion blur and large variations. The existing methods are based on extracting handcrafted features from video frames to represent the contents. These features are domain-dependent, where features that are suitable for a specific dataset may not be suitable for others. In this paper, we propose a novel solution to recognize daily living activities from a pre-segmented video clip. The pre-trained convolutional neural network (CNN) model VGG16 is used to extract visual features from sampled video frames and then aggregated by the proposed pooling scheme. The proposed solution combines appearance and motion features extracted from video frames and optical flow images, respectively. The methods of mean and max spatial pooling (MMSP) and max mean temporal pyramid (TPMM) pooling are proposed to compose the final video descriptor. The feature is applied to a linear support vector machine (SVM) to recognize the type of activities observed in the video clip. The evaluation of the proposed solution was performed on three public benchmark datasets. We performed studies to show the advantage of aggregating appearance and motion features for daily activity recognition. The results show that the proposed solution is promising for recognizing activities of daily living. Compared to several methods on three public datasets, the proposed MMSP–TPMM method produces higher classification performance in terms of accuracy (90.38% with LENA dataset, 75.37% with ADL dataset, 96.08% with FPPA dataset) and average per-class precision (AP) (58.42% with ADL dataset and 96.11% with FPPA dataset). Full article
(This article belongs to the Special Issue Applications of Body Worn Sensors and Wearables)
Show Figures

Figure 1

24 pages, 14566 KiB  
Article
YOLO Series for Human Hand Action Detection and Classification from Egocentric Videos
by Hung-Cuong Nguyen, Thi-Hao Nguyen, Rafał Scherer and Van-Hung Le
Sensors 2023, 23(6), 3255; https://doi.org/10.3390/s23063255 - 20 Mar 2023
Cited by 21 | Viewed by 8110
Abstract
Hand detection and classification is a very important pre-processing step in building applications based on three-dimensional (3D) hand pose estimation and hand activity recognition. To automatically limit the hand data area on egocentric vision (EV) datasets, especially to see the development and performance [...] Read more.
Hand detection and classification is a very important pre-processing step in building applications based on three-dimensional (3D) hand pose estimation and hand activity recognition. To automatically limit the hand data area on egocentric vision (EV) datasets, especially to see the development and performance of the “You Only Live Once” (YOLO) network over the past seven years, we propose a study comparing the efficiency of hand detection and classification based on the YOLO-family networks. This study is based on the following problems: (1) systematizing all architectures, advantages, and disadvantages of YOLO-family networks from version (v)1 to v7; (2) preparing ground-truth data for pre-trained models and evaluation models of hand detection and classification on EV datasets (FPHAB, HOI4D, RehabHand); (3) fine-tuning the hand detection and classification model based on the YOLO-family networks, hand detection, and classification evaluation on the EV datasets. Hand detection and classification results on the YOLOv7 network and its variations were the best across all three datasets. The results of the YOLOv7-w6 network are as follows: FPHAB is P = 97% with TheshIOU = 0.5; HOI4D is P = 95% with TheshIOU = 0.5; RehabHand is larger than 95% with TheshIOU = 0.5; the processing speed of YOLOv7-w6 is 60 fps with a resolution of 1280 × 1280 pixels and that of YOLOv7 is 133 fps with a resolution of 640 × 640 pixels. Full article
Show Figures

Figure 1

17 pages, 6288 KiB  
Article
Surgical Tool Detection in Open Surgery Videos
by Ryo Fujii, Ryo Hachiuma, Hiroki Kajita and Hideo Saito
Appl. Sci. 2022, 12(20), 10473; https://doi.org/10.3390/app122010473 - 17 Oct 2022
Cited by 9 | Viewed by 5795
Abstract
Detecting surgical tools is an essential task for analyzing and evaluating surgical videos. However, most studies focus on minimally invasive surgery (MIS) and cataract surgery. Mainly because of a lack of a large, diverse, and well-annotated dataset, research in the area of open [...] Read more.
Detecting surgical tools is an essential task for analyzing and evaluating surgical videos. However, most studies focus on minimally invasive surgery (MIS) and cataract surgery. Mainly because of a lack of a large, diverse, and well-annotated dataset, research in the area of open surgery has been limited so far. Open surgery video analysis is challenging because of its properties: varied number and roles of people (e.g., main surgeon, assistant surgeons, and nurses), a complex interaction of tools and hands, various operative environments, and lighting conditions. In this paper, to handle these limitations and difficulties, we introduce an egocentric open surgery dataset that includes 15 open surgeries recorded with a head-mounted camera. More than 67k bounding boxes are labeled to 19k images with 31 surgical tool categories. Finally, we present a surgical tool detection baseline model based on recent advances in object detection. The results of our new dataset show that our presented dataset provides enough interesting challenges for future methods and that it can serve as a strong benchmark to address the study of tool detection in open surgery. Full article
(This article belongs to the Special Issue Novel Advances in Computer-Assisted Surgery)
Show Figures

Figure 1

21 pages, 1481 KiB  
Article
Automatic Visual Attention Detection for Mobile Eye Tracking Using Pre-Trained Computer Vision Models and Human Gaze
by Michael Barz and Daniel Sonntag
Sensors 2021, 21(12), 4143; https://doi.org/10.3390/s21124143 - 16 Jun 2021
Cited by 26 | Viewed by 9313
Abstract
Processing visual stimuli in a scene is essential for the human brain to make situation-aware decisions. These stimuli, which are prevalent subjects of diagnostic eye tracking studies, are commonly encoded as rectangular areas of interest (AOIs) per frame. Because it is a tedious [...] Read more.
Processing visual stimuli in a scene is essential for the human brain to make situation-aware decisions. These stimuli, which are prevalent subjects of diagnostic eye tracking studies, are commonly encoded as rectangular areas of interest (AOIs) per frame. Because it is a tedious manual annotation task, the automatic detection and annotation of visual attention to AOIs can accelerate and objectify eye tracking research, in particular for mobile eye tracking with egocentric video feeds. In this work, we implement two methods to automatically detect visual attention to AOIs using pre-trained deep learning models for image classification and object detection. Furthermore, we develop an evaluation framework based on the VISUS dataset and well-known performance metrics from the field of activity recognition. We systematically evaluate our methods within this framework, discuss potentials and limitations, and propose ways to improve the performance of future automatic visual attention detection methods. Full article
(This article belongs to the Special Issue Wearable Technologies and Applications for Eye Tracking)
Show Figures

Figure 1

19 pages, 4192 KiB  
Article
STAC: Spatial-Temporal Attention on Compensation Information for Activity Recognition in FPV
by Yue Zhang, Shengli Sun, Linjian Lei, Huikai Liu and Hui Xie
Sensors 2021, 21(4), 1106; https://doi.org/10.3390/s21041106 - 5 Feb 2021
Cited by 7 | Viewed by 2308
Abstract
Egocentric activity recognition in first-person video (FPV) requires fine-grained matching of the camera wearer’s action and the objects being operated. The traditional method used for third-person action recognition does not suffice because of (1) the background ego-noise introduced by the unstructured movement of [...] Read more.
Egocentric activity recognition in first-person video (FPV) requires fine-grained matching of the camera wearer’s action and the objects being operated. The traditional method used for third-person action recognition does not suffice because of (1) the background ego-noise introduced by the unstructured movement of the wearable devices caused by body movement; (2) the small-sized and fine-grained objects with single scale in FPV. Size compensation is performed to augment the data. It generates a multi-scale set of regions, including multi-size objects, leading to superior performance. We compensate for the optical flow to eliminate the camera noise in motion. We developed a novel two-stream convolutional neural network-recurrent attention neural network (CNN-RAN) architecture: spatial temporal attention on compensation information (STAC), able to generate generic descriptors under weak supervision and focus on the locations of activated objects and the capture of effective motion. We encode the RGB features using a spatial location-aware attention mechanism to guide the representation of visual features. Similar location-aware channel attention is applied to the temporal stream in the form of stacked optical flow to implicitly select the relevant frames and pay attention to where the action occurs. The two streams are complementary since one is object-centric and the other focuses on the motion. We conducted extensive ablation analysis to validate the complementarity and effectiveness of our STAC model qualitatively and quantitatively. It achieved state-of-the-art performance on two egocentric datasets. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

26 pages, 4408 KiB  
Article
Multi-View Hand-Hygiene Recognition for Food Safety
by Chengzhang Zhong, Amy R. Reibman, Hansel A. Mina and Amanda J. Deering
J. Imaging 2020, 6(11), 120; https://doi.org/10.3390/jimaging6110120 - 7 Nov 2020
Cited by 11 | Viewed by 8719
Abstract
A majority of foodborne illnesses result from inappropriate food handling practices. One proven practice to reduce pathogens is to perform effective hand-hygiene before all stages of food handling. In this paper, we design a multi-camera system that uses video analytics to recognize hand-hygiene [...] Read more.
A majority of foodborne illnesses result from inappropriate food handling practices. One proven practice to reduce pathogens is to perform effective hand-hygiene before all stages of food handling. In this paper, we design a multi-camera system that uses video analytics to recognize hand-hygiene actions, with the goal of improving hand-hygiene effectiveness. Our proposed two-stage system processes untrimmed video from both egocentric and third-person cameras. In the first stage, a low-cost coarse classifier efficiently localizes the hand-hygiene period; in the second stage, more complex refinement classifiers recognize seven specific actions within the hand-hygiene period. We demonstrate that our two-stage system has significantly lower computational requirements without a loss of recognition accuracy. Specifically, the computationally complex refinement classifiers process less than 68% of the untrimmed videos, and we anticipate further computational gains in videos that contain a larger fraction of non-hygiene actions. Our results demonstrate that a carefully designed video action recognition system can play an important role in improving hand hygiene for food safety. Full article
Show Figures

Figure 1

Back to TopTop