Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (24)

Search Parameters:
Keywords = egocentric vision

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
22 pages, 5377 KB  
Article
Hand–Object Pose Estimation Based on Anchor Regression from a Single Egocentric Depth Image
by Jingang Lin, Dongnian Li, Chengjun Chen and Zhengxu Zhao
Sensors 2025, 25(22), 6881; https://doi.org/10.3390/s25226881 - 11 Nov 2025
Viewed by 794
Abstract
To precisely understand the interaction behaviors of humans, a computer vision system needs to accurately acquire the poses of the hand and its manipulated object. Vision-based hand–object pose estimation has become an important research topic. However, it is still a challenging task due [...] Read more.
To precisely understand the interaction behaviors of humans, a computer vision system needs to accurately acquire the poses of the hand and its manipulated object. Vision-based hand–object pose estimation has become an important research topic. However, it is still a challenging task due to severe occlusion. In this study, a hand–object pose estimation method based on anchor regression is proposed to address this problem. First, a hand–object 3D center detection method was established to extract hand–object foreground images from the original depth images. Second, a method based on anchor regression is proposed to simultaneously estimate the poses of the hand and object in a single framework. A convolutional neural network with ResNet-50 as the backbone was built to predict the position deviations and weights of the uniformly distributed anchor points in the image to the keypoints of the hand and the manipulated object. According to the experimental results on the FPHA-HO dataset, the mean keypoint errors of the hand and object of the proposed method were 11.85 mm and 18.97 mm, respectively. The proposed hand–object pose estimation method can accurately estimate the poses of the hand and the manipulated object based on a single egocentric depth image. Full article
Show Figures

Figure 1

24 pages, 18326 KB  
Article
A Human Intention and Motion Prediction Framework for Applications in Human-Centric Digital Twins
by Usman Asad, Azfar Khalid, Waqas Akbar Lughmani, Shummaila Rasheed and Muhammad Mahabat Khan
Biomimetics 2025, 10(10), 656; https://doi.org/10.3390/biomimetics10100656 - 1 Oct 2025
Viewed by 1713
Abstract
In manufacturing settings where humans and machines collaborate, understanding and predicting human intention is crucial for enabling the seamless execution of tasks. This knowledge is the basis for creating an intelligent, symbiotic, and collaborative environment. However, current foundation models often fall short in [...] Read more.
In manufacturing settings where humans and machines collaborate, understanding and predicting human intention is crucial for enabling the seamless execution of tasks. This knowledge is the basis for creating an intelligent, symbiotic, and collaborative environment. However, current foundation models often fall short in directly anticipating complex tasks and producing contextually appropriate motion. This paper proposes a modular framework that investigates strategies for structuring task knowledge and engineering context-rich prompts to guide Vision–Language Models in understanding and predicting human intention in semi-structured environments. Our evaluation, conducted across three use cases of varying complexity, reveals a critical tradeoff between prediction accuracy and latency. We demonstrate that a Rolling Context Window strategy, which uses a history of frames and the previously predicted state, achieves a strong balance of performance and efficiency. This approach significantly outperforms single-image inputs and computationally expensive in-context learning methods. Furthermore, incorporating egocentric video views yields a substantial 10.7% performance increase in complex tasks. For short-term motion forecasting, we show that the accuracy of joint position estimates is enhanced by using historical pose, gaze data, and in-context examples. Full article
Show Figures

Graphical abstract

18 pages, 4356 KB  
Article
Tacit Sustainability in the Countryside: Cultural and Ecological Layers of Lithuanian Heritage Homestead
by Indraja Raudonikyte and Indre Grazuleviciute-Vileniske
Land 2025, 14(9), 1910; https://doi.org/10.3390/land14091910 - 19 Sep 2025
Viewed by 746
Abstract
This research is an in-depth qualitative case study of a historic homestead in the town of Čekiškė, located in Lithuania, through the lens of sustainability aesthetics and cultural ecology. The research addresses a gap in the literature where aesthetic expressions of sustainability are [...] Read more.
This research is an in-depth qualitative case study of a historic homestead in the town of Čekiškė, located in Lithuania, through the lens of sustainability aesthetics and cultural ecology. The research addresses a gap in the literature where aesthetic expressions of sustainability are predominantly examined in urban settings, while rural hybrid environments, intertwining urban and traditional features, remain underexplored. The homestead, with architectural and landscape features dating back to the early 20th century, was studied across four temporal stages: the interwar period (1922–1946), the early Soviet period (1946–1976), late Soviet to post-independence (1976–2021), and the period of a proposed vision for its sustainable development (2025 and beyond). The theoretical framework developed and applied in this research combines four complementary approaches: (1) the cultural ecology model by J. Steward, (2) environmental ethics trends (egocentrism, homocentrism, biocentrism, ecocentrism), (3) the principles of biophilic design, and (4) the ecological aesthetics framework by M. DeKay. Data collection included continuous qualitative in-depth on-site observations, analysis of the relevant literature sources, archival documents and photographs, and the recording of information in photographs and drawings. The findings reveal nuanced and evolving aesthetic expressions of sustainability tied to cultural practices, land use, ownership attitudes, and environmental perception. While earlier periods of development of the analyzed homestead reflected utilitarian and homocentric relations with the environment, later stages showed increased detachment from ecological sensitivity, resulting in the degradation of both material and intangible heritage; future perspectives of the homestead being transformed into a private museum, actualizing heritage through sustainability aesthetics, were also presented. The study highlights the role of tacit knowledge and lived experience in shaping hybrid sustainable aesthetics and provides insights for design and landscape planning in rural and small town heritage contexts. The research reveals that sustainability aesthetics in rural hybrid spaces is shaped by a confluence of environmental adaptation, socio-cultural transitions, and embedded values. It argues for a more context-sensitive and historically aware approach to sustainability discourse, particularly in heritage conservation and rural development. Full article
(This article belongs to the Section Land Planning and Landscape Architecture)
Show Figures

Figure 1

18 pages, 2231 KB  
Article
VFGF: Virtual Frame-Augmented Guided Prediction Framework for Long-Term Egocentric Activity Forecasting
by Xiangdong Long, Shuqing Wang and Yong Chen
Sensors 2025, 25(18), 5644; https://doi.org/10.3390/s25185644 - 10 Sep 2025
Viewed by 1558
Abstract
Accurately predicting future activities in egocentric (first-person) videos is a challenging yet essential task, requiring robust object recognition and reliable forecasting of action patterns. However, the limited number of observable frames in such videos often lacks critical semantic context, making long-term predictions particularly [...] Read more.
Accurately predicting future activities in egocentric (first-person) videos is a challenging yet essential task, requiring robust object recognition and reliable forecasting of action patterns. However, the limited number of observable frames in such videos often lacks critical semantic context, making long-term predictions particularly difficult. Traditional approaches, especially those based on recurrent neural networks, tend to suffer from cumulative error propagation over extended time steps, leading to degraded performance. To address these challenges, this paper introduces a novel framework, Virtual Frame-Augmented Guided Forecasting (VFGF), designed specifically for long-term egocentric activity prediction. The VFGF framework enhances semantic continuity by generating and incorporating virtual frames into the observable sequence. These synthetic frames fill the temporal and contextual gaps caused by rapid changes in activity or environmental conditions. In addition, we propose a Feature Guidance Module that integrates anticipated activity-relevant features into the recursive prediction process, guiding the model toward more accurate and contextually coherent inferences. Extensive experiments on the EPIC-Kitchens dataset demonstrate that VFGF, with its interpolation-based temporal smoothing and feature-guided strategies, significantly improves long-term activity prediction accuracy. Specifically, VFGF achieves a state-of-the-art Top-5 accuracy of 44.11% at a 0.25 s prediction horizon. Moreover, it maintains competitive performance across a range of long-term forecasting intervals, highlighting its robustness and establishing a strong foundation for future research in egocentric activity prediction. Full article
(This article belongs to the Special Issue Computer Vision-Based Human Activity Recognition)
Show Figures

Figure 1

14 pages, 997 KB  
Article
Differential Performance of Children and Adults in a Vision-Deprived Maze Spatial Navigation Task and Exploration of the Impact of tDCS over the Right Posterior Parietal Cortex on Performance in Adults
by G. Nathzidy Rivera-Urbina, Noah M. Kemp, Michael A. Nitsche and Andrés Molero-Chamizo
Life 2025, 15(8), 1323; https://doi.org/10.3390/life15081323 - 20 Aug 2025
Viewed by 1464
Abstract
Spatial navigation involves the use of external (allocentric) and internal (egocentric) processing. These processes interact differentially depending on age. In order to explore the effectiveness of these interactions in different age groups (study 1), we compared the performance of children and adults in [...] Read more.
Spatial navigation involves the use of external (allocentric) and internal (egocentric) processing. These processes interact differentially depending on age. In order to explore the effectiveness of these interactions in different age groups (study 1), we compared the performance of children and adults in a two-session spatial maze task. This task was performed under deprived vision, thus preventing visual cues critical for allocentric processing. Number of correct performances and performance time were recorded as outcome measures. We recruited thirty healthy participants for the children (mean age 10.97 ± 0.55) and the adult (mean age 21.16 ± 1.76) groups, respectively. The results revealed a significantly higher number of correct actions and shorter performance times during maze solving in children compared to adults. These differences between children and adults might be due to developmental and cortical reorganization factors influencing egocentric processing. Assuming that activation of the posterior parietal cortex (PPC) facilitates egocentric spatial processing, we applied excitatory anodal tDCS over the right PPC in a second study with a different healthy adult group (N = 30, mean age 21.23 ± 2.01). Using the same spatial navigation task as in study 1, we evaluated possible performance improvements in adults associated with this neuromodulation method. Compared to a sham stimulation group, anodal tDCS over the right PPC did not significantly improve spatial task performance. Full article
(This article belongs to the Section Physiology and Pathology)
Show Figures

Figure 1

26 pages, 2873 KB  
Article
Interactive Content Retrieval in Egocentric Videos Based on Vague Semantic Queries
by Linda Ablaoui, Wilson Estecio Marcilio-Jr, Lai Xing Ng, Christophe Jouffrais and Christophe Hurter
Multimodal Technol. Interact. 2025, 9(7), 66; https://doi.org/10.3390/mti9070066 - 30 Jun 2025
Viewed by 2222
Abstract
Retrieving specific, often instantaneous, content from hours-long egocentric video footage based on hazily remembered details is challenging. Vision–language models (VLMs) have been employed to enable zero-shot textual-based content retrieval from videos. But, they fall short if the textual query contains ambiguous terms or [...] Read more.
Retrieving specific, often instantaneous, content from hours-long egocentric video footage based on hazily remembered details is challenging. Vision–language models (VLMs) have been employed to enable zero-shot textual-based content retrieval from videos. But, they fall short if the textual query contains ambiguous terms or users fail to specify their queries enough, leading to vague semantic queries. Such queries can refer to several different video moments, not all of which can be relevant, making pinpointing content harder. We investigate the requirements for an egocentric video content retrieval framework that helps users handle vague queries. First, we narrow down vague query formulation factors and limit them to ambiguity and incompleteness. Second, we propose a zero-shot, user-centered video content retrieval framework that leverages a VLM to provide video data and query representations that users can incrementally combine to refine queries. Third, we compare our proposed framework to a baseline video player and analyze user strategies for answering vague video content retrieval scenarios in an experimental study. We report that both frameworks perform similarly, users favor our proposed framework, and, as far as navigation strategies go, users value classic interactions when initiating their search and rely on the abstract semantic video representation to refine their resulting moments. Full article
Show Figures

Figure 1

39 pages, 1298 KB  
Systematic Review
Vision-Based Collision Warning Systems with Deep Learning: A Systematic Review
by Charith Chitraranjan, Vipooshan Vipulananthan and Thuvarakan Sritharan
J. Imaging 2025, 11(2), 64; https://doi.org/10.3390/jimaging11020064 - 17 Feb 2025
Cited by 5 | Viewed by 3565
Abstract
Timely prediction of collisions enables advanced driver assistance systems to issue warnings and initiate emergency maneuvers as needed to avoid collisions. With recent developments in computer vision and deep learning, collision warning systems that use vision as the only sensory input have emerged. [...] Read more.
Timely prediction of collisions enables advanced driver assistance systems to issue warnings and initiate emergency maneuvers as needed to avoid collisions. With recent developments in computer vision and deep learning, collision warning systems that use vision as the only sensory input have emerged. They are less expensive than those that use multiple sensors, but their effectiveness must be thoroughly assessed. We systematically searched academic literature for studies proposing ego-centric, vision-based collision warning systems that use deep learning techniques. Thirty-one studies among the search results satisfied our inclusion criteria. Risk of bias was assessed with PROBAST. We reviewed the selected studies and answer three primary questions: What are the (1) deep learning techniques used and how are they used? (2) datasets and experiments used to evaluate? (3) results achieved? We identified two main categories of methods: Those that use deep learning models to directly predict the probability of a future collision from input video, and those that use deep learning models at one or more stages of a pipeline to compute a threat metric before predicting collisions. More importantly, we show that the experimental evaluation of most systems is inadequate due to either not performing quantitative experiments or various biases present in the datasets used. Lack of suitable datasets is a major challenge to the evaluation of these systems and we suggest future work to address this issue. Full article
Show Figures

Figure 1

22 pages, 44857 KB  
Article
Quantifying Dwell Time With Location-based Augmented Reality: Dynamic AOI Analysis on Mobile Eye Tracking Data With Vision Transformer
by Julien Mercier, Olivier Ertz and Erwan Bocher
J. Eye Mov. Res. 2024, 17(3), 1-22; https://doi.org/10.16910/jemr.17.3.3 - 29 Apr 2024
Cited by 9 | Viewed by 2565
Abstract
Mobile eye tracking captures egocentric vision and is well-suited for naturalistic studies. However, its data is noisy, especially when acquired outdoor with multiple participants over several sessions. Area of interest analysis on moving targets is difficult because (A) camera and objects move nonlinearly [...] Read more.
Mobile eye tracking captures egocentric vision and is well-suited for naturalistic studies. However, its data is noisy, especially when acquired outdoor with multiple participants over several sessions. Area of interest analysis on moving targets is difficult because (A) camera and objects move nonlinearly and may disappear/reappear from the scene; and (B) off-the-shelf analysis tools are limited to linearly moving objects. As a result, researchers resort to time-consuming manual annotation, which limits the use of mobile eye tracking in naturalistic studies. We introduce a method based on a fine-tuned Vision Transformer (ViT) model for classifying frames with overlaying gaze markers. After fine-tuning a model on a manually labelled training set made of 1.98% (=7845 frames) of our entire data for three epochs, our model reached 99.34% accuracy as evaluated on hold-out data. We used the method to quantify participants’ dwell time on a tablet during the outdoor user test of a mobile augmented reality application for biodiversity education. We discuss the benefits and limitations of our approach and its potential to be applied to other contexts. Full article
Show Figures

Figure 1

15 pages, 641 KB  
Article
A Multi-Modal Egocentric Activity Recognition Approach towards Video Domain Generalization
by Antonios Papadakis and Evaggelos Spyrou
Sensors 2024, 24(8), 2491; https://doi.org/10.3390/s24082491 - 12 Apr 2024
Cited by 5 | Viewed by 3548
Abstract
Egocentric activity recognition is a prominent computer vision task that is based on the use of wearable cameras. Since egocentric videos are captured through the perspective of the person wearing the camera, her/his body motions severely complicate the video content, imposing several challenges. [...] Read more.
Egocentric activity recognition is a prominent computer vision task that is based on the use of wearable cameras. Since egocentric videos are captured through the perspective of the person wearing the camera, her/his body motions severely complicate the video content, imposing several challenges. In this work we propose a novel approach for domain-generalized egocentric human activity recognition. Typical approaches use a large amount of training data, aiming to cover all possible variants of each action. Moreover, several recent approaches have attempted to handle discrepancies between domains with a variety of costly and mostly unsupervised domain adaptation methods. In our approach we show that through simple manipulation of available source domain data and with minor involvement from the target domain, we are able to produce robust models, able to adequately predict human activity in egocentric video sequences. To this end, we introduce a novel three-stream deep neural network architecture combining elements of vision transformers and residual neural networks which are trained using multi-modal data. We evaluate the proposed approach using a challenging, egocentric video dataset and demonstrate its superiority over recent, state-of-the-art research works. Full article
Show Figures

Figure 1

21 pages, 179914 KB  
Article
Integrating Egocentric and Robotic Vision for Object Identification Using Siamese Networks and Superquadric Estimations in Partial Occlusion Scenarios
by Elisabeth Menendez, Santiago Martínez, Fernando Díaz-de-María and Carlos Balaguer
Biomimetics 2024, 9(2), 100; https://doi.org/10.3390/biomimetics9020100 - 8 Feb 2024
Cited by 5 | Viewed by 2740
Abstract
This paper introduces a novel method that enables robots to identify objects based on user gaze, tracked via eye-tracking glasses. This is achieved without prior knowledge of the objects’ categories or their locations and without external markers. The method integrates a two-part system: [...] Read more.
This paper introduces a novel method that enables robots to identify objects based on user gaze, tracked via eye-tracking glasses. This is achieved without prior knowledge of the objects’ categories or their locations and without external markers. The method integrates a two-part system: a category-agnostic object shape and pose estimator using superquadrics and Siamese networks. The superquadrics-based component estimates the shapes and poses of all objects, while the Siamese network matches the object targeted by the user’s gaze with the robot’s viewpoint. Both components are effectively designed to function in scenarios with partial occlusions. A key feature of the system is the user’s ability to move freely around the scenario, allowing dynamic object selection via gaze from any position. The system is capable of handling significant viewpoint differences between the user and the robot and adapts easily to new objects. In tests under partial occlusion conditions, the Siamese networks demonstrated an 85.2% accuracy in aligning the user-selected object with the robot’s viewpoint. This gaze-based Human–Robot Interaction approach demonstrates its practicality and adaptability in real-world scenarios. Full article
(This article belongs to the Special Issue Intelligent Human-Robot Interaction: 2nd Edition)
Show Figures

Figure 1

30 pages, 3417 KB  
Review
Modeling the Visual Landscape: A Review on Approaches, Methods and Techniques
by Loukas-Moysis Misthos, Vassilios Krassanakis, Nikolaos Merlemis and Anastasios L. Kesidis
Sensors 2023, 23(19), 8135; https://doi.org/10.3390/s23198135 - 28 Sep 2023
Cited by 24 | Viewed by 5956
Abstract
Modeling the perception and evaluation of landscapes from the human perspective is a desirable goal for several scientific domains and applications. Human vision is the dominant sense, and human eyes are the sensors for apperceiving the environmental stimuli of our surroundings. Therefore, exploring [...] Read more.
Modeling the perception and evaluation of landscapes from the human perspective is a desirable goal for several scientific domains and applications. Human vision is the dominant sense, and human eyes are the sensors for apperceiving the environmental stimuli of our surroundings. Therefore, exploring the experimental recording and measurement of the visual landscape can reveal crucial aspects about human visual perception responses while viewing the natural or man-made landscapes. Landscape evaluation (or assessment) is another dimension that refers mainly to preferences of the visual landscape, involving human cognition as well, in ways that are often unpredictable. Yet, landscape can be approached by both egocentric (i.e., human view) and exocentric (i.e., bird’s eye view) perspectives. The overarching approach of this review article lies in systematically presenting the different ways for modeling and quantifying the two ‘modalities’ of human perception and evaluation, under the two geometric perspectives, suggesting integrative approaches on these two ‘diverging’ dualities. To this end, several pertinent traditions/approaches, sensor-based experimental methods and techniques (e.g., eye tracking, fMRI, and EEG), and metrics are adduced and described. Essentially, this review article acts as a ‘guide-map’ for the delineation of the different activities related to landscape experience and/or management and to the valid or potentially suitable types of stimuli, sensors techniques, and metrics for each activity. Throughout our work, two main research directions are identified: (1) one that attempts to transfer the visual landscape experience/management from the one perspective to the other (and vice versa); (2) another one that aims to anticipate the visual perception of different landscapes and establish connections between perceptual processes and landscape preferences. As it appears, the research in the field is rapidly growing. In our opinion, it can be greatly advanced and enriched using integrative, interdisciplinary approaches in order to better understand the concepts and the mechanisms by which the visual landscape, as a complex set of stimuli, influences visual perception, potentially leading to more elaborate outcomes such as the anticipation of landscape preferences. As an effect, such approaches can support a rigorous, evidence-based, and socially just framework towards landscape management, protection, and decision making, based on a wide spectrum of well-suited and advanced sensor-based technologies. Full article
(This article belongs to the Section Sensing and Imaging)
Show Figures

Figure 1

20 pages, 2103 KB  
Article
Fusion of Appearance and Motion Features for Daily Activity Recognition from Egocentric Perspective
by Mohd Haris Lye, Nouar AlDahoul and Hezerul Abdul Karim
Sensors 2023, 23(15), 6804; https://doi.org/10.3390/s23156804 - 30 Jul 2023
Cited by 2 | Viewed by 1690
Abstract
Vidos from a first-person or egocentric perspective offer a promising tool for recognizing various activities related to daily living. In the egocentric perspective, the video is obtained from a wearable camera, and this enables the capture of the person’s activities in a consistent [...] Read more.
Vidos from a first-person or egocentric perspective offer a promising tool for recognizing various activities related to daily living. In the egocentric perspective, the video is obtained from a wearable camera, and this enables the capture of the person’s activities in a consistent viewpoint. Recognition of activity using a wearable sensor is challenging due to various reasons, such as motion blur and large variations. The existing methods are based on extracting handcrafted features from video frames to represent the contents. These features are domain-dependent, where features that are suitable for a specific dataset may not be suitable for others. In this paper, we propose a novel solution to recognize daily living activities from a pre-segmented video clip. The pre-trained convolutional neural network (CNN) model VGG16 is used to extract visual features from sampled video frames and then aggregated by the proposed pooling scheme. The proposed solution combines appearance and motion features extracted from video frames and optical flow images, respectively. The methods of mean and max spatial pooling (MMSP) and max mean temporal pyramid (TPMM) pooling are proposed to compose the final video descriptor. The feature is applied to a linear support vector machine (SVM) to recognize the type of activities observed in the video clip. The evaluation of the proposed solution was performed on three public benchmark datasets. We performed studies to show the advantage of aggregating appearance and motion features for daily activity recognition. The results show that the proposed solution is promising for recognizing activities of daily living. Compared to several methods on three public datasets, the proposed MMSP–TPMM method produces higher classification performance in terms of accuracy (90.38% with LENA dataset, 75.37% with ADL dataset, 96.08% with FPPA dataset) and average per-class precision (AP) (58.42% with ADL dataset and 96.11% with FPPA dataset). Full article
(This article belongs to the Special Issue Applications of Body Worn Sensors and Wearables)
Show Figures

Figure 1

16 pages, 3400 KB  
Article
Vision-Based Recognition of Human Motion Intent during Staircase Approaching
by Md Rafi Islam, Md Rejwanul Haque, Masudul H. Imtiaz, Xiangrong Shen and Edward Sazonov
Sensors 2023, 23(11), 5355; https://doi.org/10.3390/s23115355 - 5 Jun 2023
Cited by 8 | Viewed by 3474
Abstract
Walking in real-world environments involves constant decision-making, e.g., when approaching a staircase, an individual decides whether to engage (climbing the stairs) or avoid. For the control of assistive robots (e.g., robotic lower-limb prostheses), recognizing such motion intent is an important but challenging task, [...] Read more.
Walking in real-world environments involves constant decision-making, e.g., when approaching a staircase, an individual decides whether to engage (climbing the stairs) or avoid. For the control of assistive robots (e.g., robotic lower-limb prostheses), recognizing such motion intent is an important but challenging task, primarily due to the lack of available information. This paper presents a novel vision-based method to recognize an individual’s motion intent when approaching a staircase before the potential transition of motion mode (walking to stair climbing) occurs. Leveraging the egocentric images from a head-mounted camera, the authors trained a YOLOv5 object detection model to detect staircases. Subsequently, an AdaBoost and gradient boost (GB) classifier was developed to recognize the individual’s intention of engaging or avoiding the upcoming stairway. This novel method has been demonstrated to provide reliable (97.69%) recognition at least 2 steps before the potential mode transition, which is expected to provide ample time for the controller mode transition in an assistive robot in real-world use. Full article
(This article belongs to the Special Issue Human Activity Recognition in Smart Sensing Environment)
Show Figures

Graphical abstract

24 pages, 14566 KB  
Article
YOLO Series for Human Hand Action Detection and Classification from Egocentric Videos
by Hung-Cuong Nguyen, Thi-Hao Nguyen, Rafał Scherer and Van-Hung Le
Sensors 2023, 23(6), 3255; https://doi.org/10.3390/s23063255 - 20 Mar 2023
Cited by 25 | Viewed by 9105
Abstract
Hand detection and classification is a very important pre-processing step in building applications based on three-dimensional (3D) hand pose estimation and hand activity recognition. To automatically limit the hand data area on egocentric vision (EV) datasets, especially to see the development and performance [...] Read more.
Hand detection and classification is a very important pre-processing step in building applications based on three-dimensional (3D) hand pose estimation and hand activity recognition. To automatically limit the hand data area on egocentric vision (EV) datasets, especially to see the development and performance of the “You Only Live Once” (YOLO) network over the past seven years, we propose a study comparing the efficiency of hand detection and classification based on the YOLO-family networks. This study is based on the following problems: (1) systematizing all architectures, advantages, and disadvantages of YOLO-family networks from version (v)1 to v7; (2) preparing ground-truth data for pre-trained models and evaluation models of hand detection and classification on EV datasets (FPHAB, HOI4D, RehabHand); (3) fine-tuning the hand detection and classification model based on the YOLO-family networks, hand detection, and classification evaluation on the EV datasets. Hand detection and classification results on the YOLOv7 network and its variations were the best across all three datasets. The results of the YOLOv7-w6 network are as follows: FPHAB is P = 97% with TheshIOU = 0.5; HOI4D is P = 95% with TheshIOU = 0.5; RehabHand is larger than 95% with TheshIOU = 0.5; the processing speed of YOLOv7-w6 is 60 fps with a resolution of 1280 × 1280 pixels and that of YOLOv7 is 133 fps with a resolution of 640 × 640 pixels. Full article
Show Figures

Figure 1

18 pages, 4981 KB  
Article
A Systematic Approach for Developing a Robust Artwork Recognition Framework Using Smartphone Cameras
by Zenonas Theodosiou, Marios Thoma, Harris Partaourides and Andreas Lanitis
Algorithms 2022, 15(9), 305; https://doi.org/10.3390/a15090305 - 27 Aug 2022
Cited by 3 | Viewed by 2884
Abstract
The provision of information encourages people to visit cultural sites more often. Exploiting the great potential of using smartphone cameras and egocentric vision, we describe the development of a robust artwork recognition algorithm to assist users when visiting an art space. The algorithm [...] Read more.
The provision of information encourages people to visit cultural sites more often. Exploiting the great potential of using smartphone cameras and egocentric vision, we describe the development of a robust artwork recognition algorithm to assist users when visiting an art space. The algorithm recognizes artworks under any physical museum conditions, as well as camera point of views, making it suitable for different use scenarios towards an enhanced visiting experience. The algorithm was developed following a multiphase approach, including requirements gathering, experimentation in a virtual environment, development of the algorithm in real environment conditions, implementation of a demonstration smartphone app for artwork recognition and provision of assistive information, and its evaluation. During the algorithm development process, a convolutional neural network (CNN) model was trained for automatic artwork recognition using data collected in an art gallery, followed by extensive evaluations related to the parameters that may affect recognition accuracy, while the optimized algorithm was also evaluated through a dedicated app by a group of volunteers with promising results. The overall algorithm design and evaluation adopted for this work can also be applied in numerous applications, especially in cases where the algorithm performance under varying conditions and end-user satisfaction are critical factors. Full article
Show Figures

Figure 1

Back to TopTop