MDPI - Publisher of Open Access Journals

26 pages, 2873 KiB

Open AccessArticle

Interactive Content Retrieval in Egocentric Videos Based on Vague Semantic Queries

by Linda Ablaoui, Wilson Estecio Marcilio-Jr, Lai Xing Ng, Christophe Jouffrais and Christophe Hurter

Multimodal Technol. Interact. 2025, 9(7), 66; https://doi.org/10.3390/mti9070066 - 30 Jun 2025

Viewed by 612

Abstract

Retrieving specific, often instantaneous, content from hours-long egocentric video footage based on hazily remembered details is challenging. Vision–language models (VLMs) have been employed to enable zero-shot textual-based content retrieval from videos. But, they fall short if the textual query contains ambiguous terms or [...] Read more.

Retrieving specific, often instantaneous, content from hours-long egocentric video footage based on hazily remembered details is challenging. Vision–language models (VLMs) have been employed to enable zero-shot textual-based content retrieval from videos. But, they fall short if the textual query contains ambiguous terms or users fail to specify their queries enough, leading to vague semantic queries. Such queries can refer to several different video moments, not all of which can be relevant, making pinpointing content harder. We investigate the requirements for an egocentric video content retrieval framework that helps users handle vague queries. First, we narrow down vague query formulation factors and limit them to ambiguity and incompleteness. Second, we propose a zero-shot, user-centered video content retrieval framework that leverages a VLM to provide video data and query representations that users can incrementally combine to refine queries. Third, we compare our proposed framework to a baseline video player and analyze user strategies for answering vague video content retrieval scenarios in an experimental study. We report that both frameworks perform similarly, users favor our proposed framework, and, as far as navigation strategies go, users value classic interactions when initiating their search and rely on the abstract semantic video representation to refine their resulting moments. Full article

► Show Figures

Figure 1

39 pages, 1298 KiB

Open AccessSystematic Review

Vision-Based Collision Warning Systems with Deep Learning: A Systematic Review

by Charith Chitraranjan, Vipooshan Vipulananthan and Thuvarakan Sritharan

J. Imaging 2025, 11(2), 64; https://doi.org/10.3390/jimaging11020064 - 17 Feb 2025

Cited by 1 | Viewed by 1412

Abstract

Timely prediction of collisions enables advanced driver assistance systems to issue warnings and initiate emergency maneuvers as needed to avoid collisions. With recent developments in computer vision and deep learning, collision warning systems that use vision as the only sensory input have emerged. [...] Read more.

Timely prediction of collisions enables advanced driver assistance systems to issue warnings and initiate emergency maneuvers as needed to avoid collisions. With recent developments in computer vision and deep learning, collision warning systems that use vision as the only sensory input have emerged. They are less expensive than those that use multiple sensors, but their effectiveness must be thoroughly assessed. We systematically searched academic literature for studies proposing ego-centric, vision-based collision warning systems that use deep learning techniques. Thirty-one studies among the search results satisfied our inclusion criteria. Risk of bias was assessed with PROBAST. We reviewed the selected studies and answer three primary questions: What are the (1) deep learning techniques used and how are they used? (2) datasets and experiments used to evaluate? (3) results achieved? We identified two main categories of methods: Those that use deep learning models to directly predict the probability of a future collision from input video, and those that use deep learning models at one or more stages of a pipeline to compute a threat metric before predicting collisions. More importantly, we show that the experimental evaluation of most systems is inadequate due to either not performing quantitative experiments or various biases present in the datasets used. Lack of suitable datasets is a major challenge to the evaluation of these systems and we suggest future work to address this issue. Full article

(This article belongs to the Special Issue Computer Vision and Deep Learning: Trends and Applications (2nd Edition))

► Show Figures

Figure 1

22 pages, 44857 KiB

Open AccessArticle

Quantifying Dwell Time With Location-based Augmented Reality: Dynamic AOI Analysis on Mobile Eye Tracking Data With Vision Transformer

by Julien Mercier, Olivier Ertz and Erwan Bocher

J. Eye Mov. Res. 2024, 17(3), 1-22; https://doi.org/10.16910/jemr.17.3.3 - 29 Apr 2024

Cited by 5 | Viewed by 512

Abstract

Mobile eye tracking captures egocentric vision and is well-suited for naturalistic studies. However, its data is noisy, especially when acquired outdoor with multiple participants over several sessions. Area of interest analysis on moving targets is difficult because (A) camera and objects move nonlinearly [...] Read more.

Mobile eye tracking captures egocentric vision and is well-suited for naturalistic studies. However, its data is noisy, especially when acquired outdoor with multiple participants over several sessions. Area of interest analysis on moving targets is difficult because (A) camera and objects move nonlinearly and may disappear/reappear from the scene; and (B) off-the-shelf analysis tools are limited to linearly moving objects. As a result, researchers resort to time-consuming manual annotation, which limits the use of mobile eye tracking in naturalistic studies. We introduce a method based on a fine-tuned Vision Transformer (ViT) model for classifying frames with overlaying gaze markers. After fine-tuning a model on a manually labelled training set made of 1.98% (=7845 frames) of our entire data for three epochs, our model reached 99.34% accuracy as evaluated on hold-out data. We used the method to quantify participants’ dwell time on a tablet during the outdoor user test of a mobile augmented reality application for biodiversity education. We discuss the benefits and limitations of our approach and its potential to be applied to other contexts. Full article

► Show Figures

Figure 1

15 pages, 641 KiB

Open AccessArticle

A Multi-Modal Egocentric Activity Recognition Approach towards Video Domain Generalization

by Antonios Papadakis and Evaggelos Spyrou

Sensors 2024, 24(8), 2491; https://doi.org/10.3390/s24082491 - 12 Apr 2024

Cited by 4 | Viewed by 2438

Abstract

Egocentric activity recognition is a prominent computer vision task that is based on the use of wearable cameras. Since egocentric videos are captured through the perspective of the person wearing the camera, her/his body motions severely complicate the video content, imposing several challenges. [...] Read more.

Egocentric activity recognition is a prominent computer vision task that is based on the use of wearable cameras. Since egocentric videos are captured through the perspective of the person wearing the camera, her/his body motions severely complicate the video content, imposing several challenges. In this work we propose a novel approach for domain-generalized egocentric human activity recognition. Typical approaches use a large amount of training data, aiming to cover all possible variants of each action. Moreover, several recent approaches have attempted to handle discrepancies between domains with a variety of costly and mostly unsupervised domain adaptation methods. In our approach we show that through simple manipulation of available source domain data and with minor involvement from the target domain, we are able to produce robust models, able to adequately predict human activity in egocentric video sequences. To this end, we introduce a novel three-stream deep neural network architecture combining elements of vision transformers and residual neural networks which are trained using multi-modal data. We evaluate the proposed approach using a challenging, egocentric video dataset and demonstrate its superiority over recent, state-of-the-art research works. Full article

(This article belongs to the Special Issue Deep Learning Methods for Human Activity Recognition and Emotion Detection)

► Show Figures

Figure 1

21 pages, 179914 KiB

Open AccessArticle

Integrating Egocentric and Robotic Vision for Object Identification Using Siamese Networks and Superquadric Estimations in Partial Occlusion Scenarios

by Elisabeth Menendez, Santiago Martínez, Fernando Díaz-de-María and Carlos Balaguer

Biomimetics 2024, 9(2), 100; https://doi.org/10.3390/biomimetics9020100 - 8 Feb 2024

Cited by 4 | Viewed by 2207

Abstract

This paper introduces a novel method that enables robots to identify objects based on user gaze, tracked via eye-tracking glasses. This is achieved without prior knowledge of the objects’ categories or their locations and without external markers. The method integrates a two-part system: [...] Read more.

This paper introduces a novel method that enables robots to identify objects based on user gaze, tracked via eye-tracking glasses. This is achieved without prior knowledge of the objects’ categories or their locations and without external markers. The method integrates a two-part system: a category-agnostic object shape and pose estimator using superquadrics and Siamese networks. The superquadrics-based component estimates the shapes and poses of all objects, while the Siamese network matches the object targeted by the user’s gaze with the robot’s viewpoint. Both components are effectively designed to function in scenarios with partial occlusions. A key feature of the system is the user’s ability to move freely around the scenario, allowing dynamic object selection via gaze from any position. The system is capable of handling significant viewpoint differences between the user and the robot and adapts easily to new objects. In tests under partial occlusion conditions, the Siamese networks demonstrated an 85.2% accuracy in aligning the user-selected object with the robot’s viewpoint. This gaze-based Human–Robot Interaction approach demonstrates its practicality and adaptability in real-world scenarios. Full article

(This article belongs to the Special Issue Intelligent Human-Robot Interaction: 2nd Edition)

► Show Figures

Figure 1

30 pages, 3417 KiB

Open AccessReview

Modeling the Visual Landscape: A Review on Approaches, Methods and Techniques

by Loukas-Moysis Misthos, Vassilios Krassanakis, Nikolaos Merlemis and Anastasios L. Kesidis

Sensors 2023, 23(19), 8135; https://doi.org/10.3390/s23198135 - 28 Sep 2023

Cited by 19 | Viewed by 4640

Abstract

Modeling the perception and evaluation of landscapes from the human perspective is a desirable goal for several scientific domains and applications. Human vision is the dominant sense, and human eyes are the sensors for apperceiving the environmental stimuli of our surroundings. Therefore, exploring [...] Read more.

Modeling the perception and evaluation of landscapes from the human perspective is a desirable goal for several scientific domains and applications. Human vision is the dominant sense, and human eyes are the sensors for apperceiving the environmental stimuli of our surroundings. Therefore, exploring the experimental recording and measurement of the visual landscape can reveal crucial aspects about human visual perception responses while viewing the natural or man-made landscapes. Landscape evaluation (or assessment) is another dimension that refers mainly to preferences of the visual landscape, involving human cognition as well, in ways that are often unpredictable. Yet, landscape can be approached by both egocentric (i.e., human view) and exocentric (i.e., bird’s eye view) perspectives. The overarching approach of this review article lies in systematically presenting the different ways for modeling and quantifying the two ‘modalities’ of human perception and evaluation, under the two geometric perspectives, suggesting integrative approaches on these two ‘diverging’ dualities. To this end, several pertinent traditions/approaches, sensor-based experimental methods and techniques (e.g., eye tracking, fMRI, and EEG), and metrics are adduced and described. Essentially, this review article acts as a ‘guide-map’ for the delineation of the different activities related to landscape experience and/or management and to the valid or potentially suitable types of stimuli, sensors techniques, and metrics for each activity. Throughout our work, two main research directions are identified: (1) one that attempts to transfer the visual landscape experience/management from the one perspective to the other (and vice versa); (2) another one that aims to anticipate the visual perception of different landscapes and establish connections between perceptual processes and landscape preferences. As it appears, the research in the field is rapidly growing. In our opinion, it can be greatly advanced and enriched using integrative, interdisciplinary approaches in order to better understand the concepts and the mechanisms by which the visual landscape, as a complex set of stimuli, influences visual perception, potentially leading to more elaborate outcomes such as the anticipation of landscape preferences. As an effect, such approaches can support a rigorous, evidence-based, and socially just framework towards landscape management, protection, and decision making, based on a wide spectrum of well-suited and advanced sensor-based technologies. Full article

(This article belongs to the Section Sensing and Imaging)

► Show Figures

Figure 1

20 pages, 2103 KiB

Open AccessArticle

Fusion of Appearance and Motion Features for Daily Activity Recognition from Egocentric Perspective

by Mohd Haris Lye, Nouar AlDahoul and Hezerul Abdul Karim

Sensors 2023, 23(15), 6804; https://doi.org/10.3390/s23156804 - 30 Jul 2023

Cited by 2 | Viewed by 1383

Abstract

Vidos from a first-person or egocentric perspective offer a promising tool for recognizing various activities related to daily living. In the egocentric perspective, the video is obtained from a wearable camera, and this enables the capture of the person’s activities in a consistent [...] Read more.

Vidos from a first-person or egocentric perspective offer a promising tool for recognizing various activities related to daily living. In the egocentric perspective, the video is obtained from a wearable camera, and this enables the capture of the person’s activities in a consistent viewpoint. Recognition of activity using a wearable sensor is challenging due to various reasons, such as motion blur and large variations. The existing methods are based on extracting handcrafted features from video frames to represent the contents. These features are domain-dependent, where features that are suitable for a specific dataset may not be suitable for others. In this paper, we propose a novel solution to recognize daily living activities from a pre-segmented video clip. The pre-trained convolutional neural network (CNN) model VGG16 is used to extract visual features from sampled video frames and then aggregated by the proposed pooling scheme. The proposed solution combines appearance and motion features extracted from video frames and optical flow images, respectively. The methods of mean and max spatial pooling (MMSP) and max mean temporal pyramid (TPMM) pooling are proposed to compose the final video descriptor. The feature is applied to a linear support vector machine (SVM) to recognize the type of activities observed in the video clip. The evaluation of the proposed solution was performed on three public benchmark datasets. We performed studies to show the advantage of aggregating appearance and motion features for daily activity recognition. The results show that the proposed solution is promising for recognizing activities of daily living. Compared to several methods on three public datasets, the proposed MMSP–TPMM method produces higher classification performance in terms of accuracy (90.38% with LENA dataset, 75.37% with ADL dataset, 96.08% with FPPA dataset) and average per-class precision (AP) (58.42% with ADL dataset and 96.11% with FPPA dataset). Full article

(This article belongs to the Special Issue Applications of Body Worn Sensors and Wearables)

► Show Figures

Figure 1

16 pages, 3400 KiB

Open AccessArticle

Vision-Based Recognition of Human Motion Intent during Staircase Approaching

by Md Rafi Islam, Md Rejwanul Haque, Masudul H. Imtiaz, Xiangrong Shen and Edward Sazonov

Sensors 2023, 23(11), 5355; https://doi.org/10.3390/s23115355 - 5 Jun 2023

Cited by 6 | Viewed by 2668

Abstract

Walking in real-world environments involves constant decision-making, e.g., when approaching a staircase, an individual decides whether to engage (climbing the stairs) or avoid. For the control of assistive robots (e.g., robotic lower-limb prostheses), recognizing such motion intent is an important but challenging task, [...] Read more.

Walking in real-world environments involves constant decision-making, e.g., when approaching a staircase, an individual decides whether to engage (climbing the stairs) or avoid. For the control of assistive robots (e.g., robotic lower-limb prostheses), recognizing such motion intent is an important but challenging task, primarily due to the lack of available information. This paper presents a novel vision-based method to recognize an individual’s motion intent when approaching a staircase before the potential transition of motion mode (walking to stair climbing) occurs. Leveraging the egocentric images from a head-mounted camera, the authors trained a YOLOv5 object detection model to detect staircases. Subsequently, an AdaBoost and gradient boost (GB) classifier was developed to recognize the individual’s intention of engaging or avoiding the upcoming stairway. This novel method has been demonstrated to provide reliable (97.69%) recognition at least 2 steps before the potential mode transition, which is expected to provide ample time for the controller mode transition in an assistive robot in real-world use. Full article

(This article belongs to the Special Issue Human Activity Recognition in Smart Sensing Environment)

► Show Figures

Graphical abstract

24 pages, 14566 KiB

Open AccessArticle

YOLO Series for Human Hand Action Detection and Classification from Egocentric Videos

by Hung-Cuong Nguyen, Thi-Hao Nguyen, Rafał Scherer and Van-Hung Le

Sensors 2023, 23(6), 3255; https://doi.org/10.3390/s23063255 - 20 Mar 2023

Cited by 21 | Viewed by 8110

Abstract

Hand detection and classification is a very important pre-processing step in building applications based on three-dimensional (3D) hand pose estimation and hand activity recognition. To automatically limit the hand data area on egocentric vision (EV) datasets, especially to see the development and performance [...] Read more.

Hand detection and classification is a very important pre-processing step in building applications based on three-dimensional (3D) hand pose estimation and hand activity recognition. To automatically limit the hand data area on egocentric vision (EV) datasets, especially to see the development and performance of the “You Only Live Once” (YOLO) network over the past seven years, we propose a study comparing the efficiency of hand detection and classification based on the YOLO-family networks. This study is based on the following problems: (1) systematizing all architectures, advantages, and disadvantages of YOLO-family networks from version (v)1 to v7; (2) preparing ground-truth data for pre-trained models and evaluation models of hand detection and classification on EV datasets (FPHAB, HOI4D, RehabHand); (3) fine-tuning the hand detection and classification model based on the YOLO-family networks, hand detection, and classification evaluation on the EV datasets. Hand detection and classification results on the YOLOv7 network and its variations were the best across all three datasets. The results of the YOLOv7-w6 network are as follows: FPHAB is P = 97% with Thesh_IOU = 0.5; HOI4D is P = 95% with Thesh_IOU = 0.5; RehabHand is larger than 95% with Thesh_IOU = 0.5; the processing speed of YOLOv7-w6 is 60 fps with a resolution of 1280 × 1280 pixels and that of YOLOv7 is 133 fps with a resolution of 640 × 640 pixels. Full article

(This article belongs to the Special Issue Image and Video Processing and Recognition Based on Artificial Intelligence-2nd Edition)

► Show Figures

Figure 1

18 pages, 4981 KiB

Open AccessArticle

A Systematic Approach for Developing a Robust Artwork Recognition Framework Using Smartphone Cameras

by Zenonas Theodosiou, Marios Thoma, Harris Partaourides and Andreas Lanitis

Algorithms 2022, 15(9), 305; https://doi.org/10.3390/a15090305 - 27 Aug 2022

Cited by 3 | Viewed by 2406

Abstract

The provision of information encourages people to visit cultural sites more often. Exploiting the great potential of using smartphone cameras and egocentric vision, we describe the development of a robust artwork recognition algorithm to assist users when visiting an art space. The algorithm [...] Read more.

The provision of information encourages people to visit cultural sites more often. Exploiting the great potential of using smartphone cameras and egocentric vision, we describe the development of a robust artwork recognition algorithm to assist users when visiting an art space. The algorithm recognizes artworks under any physical museum conditions, as well as camera point of views, making it suitable for different use scenarios towards an enhanced visiting experience. The algorithm was developed following a multiphase approach, including requirements gathering, experimentation in a virtual environment, development of the algorithm in real environment conditions, implementation of a demonstration smartphone app for artwork recognition and provision of assistive information, and its evaluation. During the algorithm development process, a convolutional neural network (CNN) model was trained for automatic artwork recognition using data collected in an art gallery, followed by extensive evaluations related to the parameters that may affect recognition accuracy, while the optimized algorithm was also evaluated through a dedicated app by a group of volunteers with promising results. The overall algorithm design and evaluation adopted for this work can also be applied in numerous applications, especially in cases where the algorithm performance under varying conditions and end-user satisfaction are critical factors. Full article

► Show Figures

Figure 1

22 pages, 3116 KiB

Open AccessArticle

Impact of a Vibrotactile Belt on Emotionally Challenging Everyday Situations of the Blind

by Charlotte Brandebusemeyer, Anna Ricarda Luther, Sabine U. König, Peter König and Silke M. Kärcher

Sensors 2021, 21(21), 7384; https://doi.org/10.3390/s21217384 - 6 Nov 2021

Cited by 6 | Viewed by 4322

Abstract

Spatial orientation and navigation depend primarily on vision. Blind people lack this critical source of information. To facilitate wayfinding and to increase the feeling of safety for these people, the “feelSpace belt” was developed. The belt signals magnetic north as a fixed reference [...] Read more.

Spatial orientation and navigation depend primarily on vision. Blind people lack this critical source of information. To facilitate wayfinding and to increase the feeling of safety for these people, the “feelSpace belt” was developed. The belt signals magnetic north as a fixed reference frame via vibrotactile stimulation. This study investigates the effect of the belt on typical orientation and navigation tasks and evaluates the emotional impact. Eleven blind subjects wore the belt daily for seven weeks. Before, during and after the study period, they filled in questionnaires to document their experiences. A small sub-group of the subjects took part in behavioural experiments before and after four weeks of training, i.e., a straight-line walking task to evaluate the belt’s effect on keeping a straight heading, an angular rotation task to examine effects on egocentric orientation, and a triangle completion navigation task to test the ability to take shortcuts. The belt reduced subjective discomfort and increased confidence during navigation. Additionally, the participants felt safer wearing the belt in various outdoor situations. Furthermore, the behavioural tasks point towards an intuitive comprehension of the belt. Altogether, the blind participants benefited from the vibrotactile belt as an assistive technology in challenging everyday situations. Full article

(This article belongs to the Special Issue Intelligent Systems and Sensors for Assistive Technology)

► Show Figures

Graphical abstract

15 pages, 2532 KiB

Open AccessArticle

Braille Block Detection via Multi-Objective Optimization from an Egocentric Viewpoint

by Tsubasa Takano, Takumi Nakane, Takuya Akashi and Chao Zhang

Sensors 2021, 21(8), 2775; https://doi.org/10.3390/s21082775 - 14 Apr 2021

Cited by 4 | Viewed by 3348

Abstract

In this paper, we propose a method to detect Braille blocks from an egocentric viewpoint, which is a key part of many walking support devices for visually impaired people. Our main contribution is to cast this task as a multi-objective optimization problem and [...] Read more.

In this paper, we propose a method to detect Braille blocks from an egocentric viewpoint, which is a key part of many walking support devices for visually impaired people. Our main contribution is to cast this task as a multi-objective optimization problem and exploits both the geometric and the appearance features for detection. Specifically, two objective functions were designed under an evolutionary optimization framework with a line pair modeled as an individual (i.e., solution). Both of the objectives follow the basic characteristics of the Braille blocks, which aim to clarify the boundaries and estimate the likelihood of the Braille block surface. Our proposed method was assessed by an originally collected and annotated dataset under real scenarios. Both quantitative and qualitative experimental results show that the proposed method can detect Braille blocks under various environments. We also provide a comprehensive comparison of the detection performance with respect to different multi-objective optimization algorithms. Full article

(This article belongs to the Section Intelligent Sensors)

► Show Figures

Figure 1

17 pages, 1790 KiB

Open AccessArticle

Performance Boosting of Scale and Rotation Invariant Human Activity Recognition (HAR) with LSTM Networks Using Low Dimensional 3D Posture Data in Egocentric Coordinates

by Ibrahim Furkan Ince

Appl. Sci. 2020, 10(23), 8474; https://doi.org/10.3390/app10238474 - 27 Nov 2020

Cited by 8 | Viewed by 3472

Abstract

Human activity recognition (HAR) has been an active area in computer vision with a broad range of applications, such as education, security surveillance, and healthcare. HAR is a general time series classification problem. LSTMs are widely used for time series classification tasks. However, [...] Read more.

Human activity recognition (HAR) has been an active area in computer vision with a broad range of applications, such as education, security surveillance, and healthcare. HAR is a general time series classification problem. LSTMs are widely used for time series classification tasks. However, they work well with high-dimensional feature vectors, which reduce the processing speed of LSTM in real-time applications. Therefore, dimension reduction is required to create low-dimensional feature space. As it is experimented in previous study, LSTM with dimension reduction yielded the worst performance among other classifiers, which are not deep learning methods. Therefore, in this paper, a novel scale and rotation invariant human activity recognition system, which can also work in low dimensional feature space is presented. For this purpose, Kinect depth sensor is employed to obtain skeleton joints. Since angles are used, proposed system is already scale invariant. In order to provide rotation invariance, body relative direction in egocentric coordinates is calculated. The 3D vector between right hip and left hip is used to get the horizontal axis and its cross product with the vertical axis of global coordinate system assumed to be the depth axis of the proposed local coordinate system. Instead of using 3D joint angles, 8 number of limbs and their corresponding 3D angles with X, Y, and Z axes of the proposed coordinate system are compressed with several dimension reduction methods such as averaging filter, Haar wavelet transform (HWT), and discrete cosine transform (DCT) and employed as the feature vector. Finally, extracted features are trained and tested with LSTM (long short-term memory) network, which is an artificial recurrent neural network (RNN) architecture. Experimental and benchmarking results indicate that proposed framework boosts the performance of LSTM by approximately 30% accuracy in low-dimensional feature space. Full article

(This article belongs to the Special Issue AI, Machine Learning and Deep Learning in Signal Processing)

► Show Figures

Figure 1

18 pages, 5300 KiB

Open AccessArticle

Toward Joint Acquisition-Annotation of Images with Egocentric Devices for a Lower-Cost Machine Learning Application to Apple Detection

by Salma Samiei, Pejman Rasti, Paul Richard, Gilles Galopin and David Rousseau

Sensors 2020, 20(15), 4173; https://doi.org/10.3390/s20154173 - 27 Jul 2020

Cited by 8 | Viewed by 4087

Abstract

Since most computer vision approaches are now driven by machine learning, the current bottleneck is the annotation of images. This time-consuming task is usually performed manually after the acquisition of images. In this article, we assess the value of various egocentric vision approaches [...] Read more.

Since most computer vision approaches are now driven by machine learning, the current bottleneck is the annotation of images. This time-consuming task is usually performed manually after the acquisition of images. In this article, we assess the value of various egocentric vision approaches in regard to performing joint acquisition and automatic image annotation rather than the conventional two-step process of acquisition followed by manual annotation. This approach is illustrated with apple detection in challenging field conditions. We demonstrate the possibility of high performance in automatic apple segmentation (Dice 0.85), apple counting (88 percent of probability of good detection, and 0.09 true-negative rate), and apple localization (a shift error of fewer than 3 pixels) with eye-tracking systems. This is obtained by simply applying the areas of interest captured by the egocentric devices to standard, non-supervised image segmentation. We especially stress the importance in terms of time of using such eye-tracking devices on head-mounted systems to jointly perform image acquisition and automatic annotation. A gain of time of over 10-fold by comparison with classical image acquisition followed by manual image annotation is demonstrated. Full article

(This article belongs to the Special Issue Low-Cost Sensors and Vectors for Plant Phenotyping)

► Show Figures

Figure 1

22 pages, 3171 KiB

Open AccessArticle

Fusing Object Information and Inertial Data for Activity Recognition

by Alexander Diete and Heiner Stuckenschmidt

Sensors 2019, 19(19), 4119; https://doi.org/10.3390/s19194119 - 23 Sep 2019

Cited by 8 | Viewed by 3530

Abstract

In the field of pervasive computing, wearable devices have been widely used for recognizing human activities. One important area in this research is the recognition of activities of daily living where especially inertial sensors and interaction sensors (like RFID tags with scanners) are [...] Read more.

In the field of pervasive computing, wearable devices have been widely used for recognizing human activities. One important area in this research is the recognition of activities of daily living where especially inertial sensors and interaction sensors (like RFID tags with scanners) are popular choices as data sources. Using interaction sensors, however, has one drawback: they may not differentiate between proper interaction and simple touching of an object. A positive signal from an interaction sensor is not necessarily caused by a performed activity e.g., when an object is only touched but no interaction occurred afterwards. There are, however, many scenarios like medicine intake that rely heavily on correctly recognized activities. In our work, we aim to address this limitation and present a multimodal egocentric-based activity recognition approach. Our solution relies on object detection that recognizes activity-critical objects in a frame. As it is infeasible to always expect a high quality camera view, we enrich the vision features with inertial sensor data that monitors the users’ arm movement. This way we try to overcome the drawbacks of each respective sensor. We present our results of combining inertial and video features to recognize human activities on different types of scenarios where we achieve an

F_{1}

-measure of up to 79.6%. Full article

(This article belongs to the Special Issue Multi-Sensor Fusion in Body Sensor Networks)

► Show Figures

Figure 1

Search Results (19)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (19)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI