RGB-D Cameras and Brain–Computer Interfaces for Human Activity Recognition: An Overview
Abstract
1. Introduction
2. Background: Approaches for HAR
3. HAR Based on RGB-D Cameras
3.1. ML Techniques
Dataset | Data | Activities |
---|---|---|
MSRC-12 [47] | 6244 video samples with skeletal joints position | 12 activities from 30 subjects |
NTU-RGB+ D60 [60] | 56,880 video samples, including RGB, infrared, depth, and skeleton data | 60 ADLs |
NTU-RGB+ D120 [61] | Expanded version of NTU-RGB+ D60, with 114,480 video samples | 120 ADLs |
MA-52 [63] | 22,422 video samples | 52 micro-actions from 205 subjects |
CAD-120 [65] | 120 video samples | 10 high-level activities, 10 sub-activities, and 12 object affordance from 4 subjects |
PKU-MMD [74] | 1076 video sequences, including RGB, depth, infrared, and skeleton data | 51 ADLs by 66 subjects |
Berkeley-MHAD [75] | 660 action sequences recorded from 4 cameras, 2 Kinect cameras, 6 accelerometers, and 4 microphones | 11 ADLs by 12 subjects with 5 repetitions for each ADL |
HWU-USP [76] | Recordings from ambient sensors, inertial units, and RGB videos for a total of 144 recording sessions | 9 ADLs by 16 subjects |
Northwestern-UCLA Multiview [77] | RBG, depth, and skeleton data recorded by 3 Kinect | 10 ADLs by 10 subjects |
UTD-MHAD [78] | RGB, depth, skeleton data recorded from 1 Kinect camera, and inertial data from a single inertial sensor | 27 ADLs by 8 subjects, with 4 repetitions for each ADL |
Toyota Smarthome [79] | 16,115 video samples, including RGB, depth, and skeleton data recorded from 7 Kinect cameras | 31 ADLs from 18 subjects |
UFC101 [68] | 13,320 video samples | 101 ADLs |
KTH [69] | 2391 video samples from 25 people | 6 ADLs in four different contexts |
WEIZMANN [70] | 90 low resolution video sequences | 10 ADLs by 9 subjects |
IXMAS [71] | 2880 RGB video sequences | 15 ADLs by 5 subjects, with each ADL repeated 3 times |
Kinetics-700-2020 [62] | Collection of a series of datasets containing up to 700 video samples for each ADL | Up to 700 ADLs, depending on the considered specific datasets |
Kinetic-Skeleteon [80] | Derived from Kinetics dataset, using OpenPose to extract skeleton key points, and made by 300,000 videos | 400 ADLs |
HMDB51 [81] | 6776 annotated video samples from various sources as movies and YouTube | 51 ADLs |
STH-STH V1 [82] | 108,499 video samples | 174 ADLs from 1133 crowdsource workers |
UCF50 [83] | 6681 video samples | 50 ADLs |
YouTube Action [84] | 1600 video samples from YouTube | 11 ADLs |
MSR-3D [85] | 320 video frames, including skeleton data, RGB, and depth images | 16 ADLs |
JHMDB [86] | 928 RGB video, for a total of 31,838 frames | 21 ADLs |
3.2. Uni-Modal vs. Multi-Modal Sensing
4. HAR Based on BCIs
Human-In-The-Loop Approaches
5. RGB-D and BCI Integration: Opportunities, Challenges, and Open Issues
5.1. RGB-D and BCI Fusion Pipeline
5.2. Opportunities of RGB-D and BCI Fusion
5.3. Challenges in RGB-D and BCI Integration
5.4. Open Issues in RGB-D and BCI Integration
- Lightweight multi-modal fusion algorithms: There is a pressing need to develop more efficient data fusion and ML/DL algorithms to handle multi-stream RGB-D and EEG data in real time, without excessive computational load. In particular, this can include the design of optimized and efficient feature extraction methodologies [155], that, however, must take into account the heterogeneous nature of the input data, i.e., images and biosignal time series. In this case, adopting multi-domain features and advanced methodologies for feature selection and information fusion [156] represent attractive possibilities for the optimization of the feature extraction pipeline, reducing the computational burden of this processing step. However, it is worth noticing that RGB-D images require the major part of the computational resources in terms of data processing. Thus, other viable solutions can be found by considering techniques for compressing images and removing non-necessary or redundant information [99]. On the other hand, there is also the need for exploring computational solutions, optimized neural networks, and advanced signal processing techniques to ensure the system can run continuously in a home environment (potentially on embedded hardware) without sacrificing accuracy. This can be achieved by leveraging tiny learning models specifically designed for mobile or low-resource devices [157], and by relying on the growing computational power of newer devices. Another approach to deal with these issues is to transfer part of the computational demands to the cloud or shared services, or to adopt edge computing solutions within an optimized framework [158].
- Robustness in unconstrained environments: To be practical in daily living, HAR systems must be resilient to the messy, unpredictable nature of real homes. Future studies should emphasize improving the robustness and accuracy of activity recognition under real-world conditions, such as varying lighting, background clutter, the presence of multiple subjects, or user movement, as well as coping with EEG noise from muscle activity or electrical interference. The latter can be addressed by relying on lightweight but robust recently proposed algorithmic solutions [159]. This may involve creating large and diverse datasets for training, using adaptive algorithms that can learn from a user routine, or integrating additional context (e.g., time of day, habitual patterns) to reduce false detections. This point represents one of the major issues for the deployment of image-based, hybrid HAR solutions within real environments. Indeed, as outlined also in previous sections, in practice it is not possible to enforce the presence of a single user within a certain environment. In practical applications, it is also likely that different users would perform different activities to be recognized. To address the complexities imposed by a multi-person scenario, the usage of multiple cameras can be a viable solution [103], while the recognition of multiple activities performed at the same time would require the usage of tailored and specific processing procedures [104]. Real environments, with their complex backgrounds, can often lead to occlusion problems that can be faced by relying on multiple-angle views of the same scene [160]. Variable lighting conditions, which can heavily affect the outcome of HAR under consideration, could also be successfully handled through specific post-processing procedures, nowadays also tailored to limited computational resources, thus reducing their impact on the demands of the overall recognition architecture [161].
- More user-friendly BCI devices: Engineering advances are needed to design BCIs that are comfortable, unobtrusive, and easy to operate for non-expert users. This could mean wireless, miniaturized EEG systems with dry electrodes (avoiding lengthy setup), longer battery life, and auto-calibration features. Improving signal quality through better sensors or algorithms (to filter out artifacts) will also enhance reliability. By making the BCI hardware invisible and hassle-free, users are more likely to wear it regularly, which is crucial for continuous HAR in AAL. In particular, an attractive solution for the envisioned framework can be the usage of in-ear EEG-based BCI [162], which has been conceived in order to offer an alternative way for recording brain activity by using probes placed on the ear and in the ear canal. Even though the EEG acquired with this kind of sensing technology does not fully meet all the characteristics provided by scalp EEG, in-ear technology represents a suitable compromise for developing lightweight, unobtrusive, and comfortable BCI applications. Furthermore, in-ear EEG systems have also been integrated within large body-area networks for health monitoring [163], thus pointing out the combination of this kind of technology with RGB-D systems as an actual possible solution for RGB-D and BCI fusion.
- User acceptance and usability studies: Going forward, researchers should extensively study how target user groups (e.g., older adults, people with disabilities) perceive and interact with these integrated systems. Early involvement of end-users through participatory design can identify usability pain points and preferences, ensuring the solutions truly meet user needs. In fact, further investigations into user acceptance of BCI-augmented AAL systems are needed [164], as prior work has mainly examined acceptance of camera-based monitoring alone. Understanding factors that influence trust, such as system transparency, feedback provided to the user, and perceived benefits, will be vital. Strategies to improve acceptance might include personalized adaptation (tuning the system responses to individual comfort levels), training and onboarding programs to help users get comfortable with BCIs, and integrating privacy-respecting options (for example, allowing users to easily pause or control data collection). User acceptance and usability could be further enhanced by developing user-independent architectures for BCI applications in order to avoid the need for retraining the model when used on unseen, new individuals. For achieving this objective, some promising solutions are currently available, allowing also to address the cross-session issue [165]. By applying such kind of processing on EEG data, it would also be possible to lower computational demands and time needed for calibration, since data from other users or from different sessions could be used in the initialization phase, enhancing also the usability and acceptability of the BCI usage in a real-life scenario.
- Privacy protection and ethical frameworks: In tandem with technical improvements, there must be a concerted effort to establish comprehensive ethical and legal guidelines for deploying these technologies in private homes. Future work should engage ethicists, legal experts, and policymakers to develop standards for data handling that ensure strict privacy, security, and user consent. This includes determining how neural and video data should be stored and used and setting limits to prevent any form of data abuse [153]. Researchers have noted that mitigating privacy concerns through personalized, transparent system design is essential to overcome user barriers and build long-term trust, and that properly structured frameworks will be crucial for responsible innovation in this domain [154]. By embedding ethical considerations into the design (such as on-device data processing to keep raw data private or providing users with clear control over their information), developers can foster greater user confidence and societal acceptance of these advanced AAL solutions [166]. It is worth noticing that for addressing privacy issues, the usage of skeleton data extracted from RGB-D images appears convenient, since this kind of representation modality discards any user identity information, also making the background scene not identifiable. However, as discussed also in the previous sections, the usage of a multi-modal approach, where skeleton and RGB-D modalities are combined, provides significant advantages in terms of HAR results. Also, multi-modal schemes proved to be less vulnerable under adversarial attack, thus providing more robust applications in terms of safety and privacy concerns [87,167]. Therefore, balancing ethical concerns and technical outcomes still remains an open issue that should be evaluated depending on settings, environments, target users, and desired results for each specific HAR application.
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Colantonio, S.; Aleksic, S.; Calleja Agius, J.; Camilleri, K.P.; Čartolovni, A.; Climent-Pérez, P.; Cristina, S.; Despotovic, V.; Ekenel, H.K.; Erakin, M.E.; et al. A Historical View of Active Assisted Living. In Privacy-Aware Monitoring for Assisted Living: Ethical, Legal, and Technological Aspects of Audio- and Video-Based AAL Solutions; Salah, A.A., Colonna, L., Florez-Revuelta, F., Eds.; Springer Nature Switzerland: Cham, Switzerland, 2025; pp. 3–44. [Google Scholar] [CrossRef]
- Eurostat. Population Structure and Ageing. 2025. Available online: https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Population_structure_and_ageing (accessed on 10 July 2025).
- Periša, M.; Teskera, P.; Cvitić, I.; Grgurević, I. Empowering People with Disabilities in Smart Homes Using Predictive Informing. Sensors 2025, 25, 284. [Google Scholar] [CrossRef]
- Cicirelli, G.; Marani, R.; Petitti, A.; Milella, A.; D’Orazio, T. Ambient Assisted Living: A Review of Technologies, Methodologies and Future Perspectives for Healthy Aging of Population. Sensors 2021, 21, 3549. [Google Scholar] [CrossRef] [PubMed]
- Choudhury, N.A.; Soni, B. In-depth analysis of design & development for sensor-based human activity recognition system. Multimed. Tools Appl. 2023, 83, 73233–73272. [Google Scholar] [CrossRef]
- Newaz, N.T.; Hanada, E. The Methods of Fall Detection: A Literature Review. Sensors 2023, 23, 5212. [Google Scholar] [CrossRef] [PubMed]
- Seo, K.J.; Lee, J.; Cho, J.E.; Kim, H.; Kim, J.H. Gait Environment Recognition Using Biomechanical and Physiological Signals with Feed-Forward Neural Network: A Pilot Study. Sensors 2025, 25, 4302. [Google Scholar] [CrossRef]
- Iadarola, G.; Mengarelli, A.; Spinsante, S. Classification of Physical Fatigue on Heart Rate by Wearable Devices. In Proceedings of the 2025 IEEE Medical Measurements & Applications (MeMeA), Chania, Greece, 28–30 May 2025; pp. 1–6. [Google Scholar] [CrossRef]
- Pavaiyarkarasi, R.; Paulraj, D. A Concept to Reality: New Horizons in Alzheimer’s Assistive Devices. In Proceedings of the 2025 Third International Conference on Augmented Intelligence and Sustainable Systems (ICAISS), Trichy, India, 21–23 May 2025; pp. 1264–1271. [Google Scholar] [CrossRef]
- Iadarola, G.; Mengarelli, A.; Crippa, P.; Fioretti, S.; Spinsante, S. A Review on Assisted Living Using Wearable Devices. Sensors 2024, 24, 7439. [Google Scholar] [CrossRef]
- Salem, Z.; Weiss, A.P. Improved Spatiotemporal Framework for Human Activity Recognition in Smart Environment. Sensors 2023, 23, 132. [Google Scholar] [CrossRef]
- Ceron, J.D.; López, D.M.; Kluge, F.; Eskofier, B.M. Framework for Simultaneous Indoor Localization, Mapping, and Human Activity Recognition in Ambient Assisted Living Scenarios. Sensors 2022, 22, 3364. [Google Scholar] [CrossRef]
- Chimamiwa, G.; Giaretta, A.; Alirezaie, M.; Pecora, F.; Loutfi, A. Are Smart Homes Adequate for Older Adults with Dementia? Sensors 2022, 22, 4254. [Google Scholar] [CrossRef]
- Vasylkiv, Y.; Neshati, A.; Sakamoto, Y.; Gomez, R.; Nakamura, K.; Irani, P. Smart home interactions for people with reduced hand mobility using subtle EMG-signal gestures. In Improving Usability, Safety and Patient Outcomes with Health Information Technology; IOS Press: London, UK, 2019; pp. 436–443. [Google Scholar] [CrossRef]
- De Venuto, D.; Annese, V.F.; Sangiovanni-Vincentelli, A.L. The ultimate IoT application: A cyber-physical system for ambient assisted living. In Proceedings of the 2016 IEEE International Symposium on Circuits and Systems (ISCAS), Montréal, QC, Canada, 22–25 May 2016; pp. 2042–2045. [Google Scholar] [CrossRef]
- Scattolini, M.; Tigrini, A.; Verdini, F.; Iadarola, G.; Spinsante, S.; Fioretti, S.; Burattini, L.; Mengarelli, A. Leveraging inertial information from a single IMU for human daily activity recognition. In Proceedings of the 2024 IEEE International Symposium on Medical Measurements and Applications (MeMeA), Eindhoven, The Netherlands, 26–28 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–6. [Google Scholar] [CrossRef]
- Iadarola, G.; Scoccia, C.; Spinsante, S.; Rossi, L.; Monteriù, A. An Overview on Current Technologies for Assisted Living. In Proceedings of the 2024 IEEE International Workshop on Metrology for Living Environment (MetroLivEnv), Chania, Greece, 12–14 June 2024; pp. 190–195. [Google Scholar] [CrossRef]
- Villarroel F, M.J.; Villarroel G, C.H. Wireless smart environment in Ambient Assisted Living for people that suffer from cognitive disabilities. Ingeniare. Rev. Chil. IngenierÃa 2014, 22, 158–168. [Google Scholar] [CrossRef]
- Caiado, F.; Ukolov, A. The history, current state and future possibilities of the non-invasive brain computer interfaces. Med. Nov. Technol. Devices 2025, 25, 100353. [Google Scholar] [CrossRef]
- Kumar Gouda, S.; Choudhry, A.; Satpathy, S.P.; Shukla, K.M.; Dash, A.K.; Pasayat, A.K. Integration of EEG-based BCI technology in IoT enabled smart home environment: An in-depth comparative analysis on human-computer interaction techniques. Expert Syst. Appl. 2025, 294, 128730. [Google Scholar] [CrossRef]
- Iadarola, G.; Cosoli, G.; Scalise, L.; Spinsante, S. Unsupervised Learning of Physical Effort: Proposal of a simple metric for wearable devices. In Proceedings of the 2025 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), Chemnitz, Germany, 19–22 May 2025; pp. 1–6. [Google Scholar] [CrossRef]
- Kang, H.; Lee, C.; Kang, S.J. A smart device for non-invasive ADL estimation through multi-environmental sensor fusion. Sci. Rep. 2023, 13, 17246. [Google Scholar] [CrossRef] [PubMed]
- Shaikh, M.B.; Chai, D. Rgb-d data-based action recognition: A review. Sensors 2021, 21, 4246. [Google Scholar] [CrossRef] [PubMed]
- Leelaarporn, P.; Wachiraphan, P.; Kaewlee, T.; Udsa, T.; Chaisaen, R.; Choksatchawathi, T.; Laosirirat, R.; Lakhan, P.; Natnithikarat, P.; Thanontip, K.; et al. Sensor-Driven Achieving of Smart Living: A Review. IEEE Sens. J. 2021, 21, 10369–10391. [Google Scholar] [CrossRef]
- Kristoffersson, A.; Lindén, M. A systematic review of wearable sensors for monitoring physical activity. Sensors 2022, 22, 573. [Google Scholar] [CrossRef]
- Oyibo, K.; Wang, K.; Morita, P.P. Using Smart Home Technologies to Promote Physical Activity Among the General and Aging Populations: Scoping Review. J. Med. Internet Res. 2023, 25, e41942. [Google Scholar] [CrossRef]
- Al Farid, F.; Bari, A.; Miah, A.S.M.; Mansor, S.; Uddin, J.; Kumaresan, S.P. A Structured and Methodological Review on Multi-View Human Activity Recognition for Ambient Assisted Living. J. Imaging 2025, 11, 182. [Google Scholar] [CrossRef]
- Qi, W.; Xu, X.; Qian, K.; Schuller, B.W.; Fortino, G.; Aliverti, A. A Review of AIoT-Based Human Activity Recognition: From Application to Technique. IEEE J. Biomed. Health Inform. 2025, 29, 2425–2438. [Google Scholar] [CrossRef]
- Karim, M.; Khalid, S.; Lee, S.; Almutairi, S.; Namoun, A.; Abohashrh, M. Next Generation Human Action Recognition: A Comprehensive Review of State-of-the-Art Signal Processing Techniques. IEEE Access 2025, 13, 135609–135633. [Google Scholar] [CrossRef]
- Wang, C.; Jiang, W.; Yang, K.; Yu, D.; Newn, J.; Sarsenbayeva, Z.; Goncalves, J.; Kostakos, V. Electronic monitoring systems for hand hygiene: Systematic review of technology. J. Med. Internet Res. 2021, 23, e27880. [Google Scholar] [CrossRef]
- Chun, K.S.; Sanders, A.B.; Adaimi, R.; Streeper, N.; Conroy, D.E.; Thomaz, E. Towards a generalizable method for detecting fluid intake with wrist-mounted sensors and adaptive segmentation. In Proceedings of the 24th International Conference on Intelligent User Interfaces, New York, NY, USA, 17–20 March 2019; pp. 80–85. [Google Scholar] [CrossRef]
- Sabry, F.; Eltaras, T.; Labda, W.; Hamza, F.; Alzoubi, K.; Malluhi, Q. Towards on-device dehydration monitoring using machine learning from wearable device’s data. Sensors 2022, 22, 1887. [Google Scholar] [CrossRef] [PubMed]
- Moccia, S.; Solbiati, S.; Khornegah, M.; Bossi, F.F.; Caiani, E.G. Automated classification of hand gestures using a wristband and machine learning for possible application in pill intake monitoring. Comput. Methods Programs Biomed. 2022, 219, 106753. [Google Scholar] [CrossRef] [PubMed]
- JCGM 100:2008; GUM 1995 with Minor Corrections. Evaluation of Measurement Data—Guide to the Expression of Uncertainty in Measurement. Sèvres International Bureau of Weights and Measures (BIPM): Sèvres, France, 2008.
- Offermann, J.; Wilkowska, W.; Poli, A.; Spinsante, S.; Ziefle, M. Acceptance and Preferences of Using Ambient Sensor-Based Lifelogging Technologies in Home Environments. Sensors 2021, 21, 8297. [Google Scholar] [CrossRef] [PubMed]
- Ankalaki, S. Simple to Complex, Single to Concurrent Sensor-Based Human Activity Recognition: Perception and Open Challenges. IEEE Access 2024, 12, 93450–93486. [Google Scholar] [CrossRef]
- Vrigkas, M.; Nikou, C.; Kakadiaris, I.A. A review of human activity recognition methods. Front. Robot. AI 2015, 2, 28. [Google Scholar] [CrossRef]
- Yeung, L.F.; Yang, Z.; Cheng, K.C.; Du, D.; Tong, R.K. Effects of camera viewing angles on tracking kinematic gait patterns using Azure Kinect, Kinect v2 and Orbbec Astra Pro v2. Gait Posture 2021, 87, 19–26. [Google Scholar] [CrossRef]
- Carfagni, M.; Furferi, R.; Governi, L.; Santarelli, C.; Servi, M.; Uccheddu, F.; Volpe, Y. Metrological and Critical Characterization of the Intel D415 Stereo Depth Camera. Sensors 2019, 19, 489. [Google Scholar] [CrossRef]
- Tölgyessy, M.; Dekan, M.; Chovanec, Ľ.; Hubinský, P. Evaluation of the Azure Kinect and Its Comparison to Kinect V1 and Kinect V2. Sensors 2021, 21, 413. [Google Scholar] [CrossRef]
- Kurillo, G.; Hemingway, E.; Cheng, M.L.; Cheng, L. Evaluating the Accuracy of the Azure Kinect and Kinect v2. Sensors 2022, 22, 2469. [Google Scholar] [CrossRef]
- Burger, L.; Burger, L.; Sharan, L.; Karl, R.; Wang, C.; Karck, M.; De Simone, R.; Wolf, I.; Romano, G.; Engelhardt, S. Comparative evaluation of three commercially available markerless depth sensors for close-range use in surgical simulation. Int. J. Comput. Assist. Radiol. Surg. 2023, 18, 1109–1118. [Google Scholar] [CrossRef]
- Büker, L.; Quinten, V.; Hackbarth, M.; Hellmers, S.; Diekmann, R.; Hein, A. How the Processing Mode Influences Azure Kinect Body Tracking Results. Sensors 2023, 23, 878. [Google Scholar] [CrossRef]
- Gonzalez-Jorge, H.; Riveiro, B.; Vazquez-Fernandez, E.; Martínez-Sánchez, J.; Arias, P. Metrological evaluation of Microsoft Kinect and Asus Xtion sensors. Measurement 2013, 46, 1800–1806. [Google Scholar] [CrossRef]
- Haider, A.; Hel-Or, H. What Can We Learn from Depth Camera Sensor Noise? Sensors 2022, 22, 5448. [Google Scholar] [CrossRef]
- OpenPR—Worldwide Public Relations. Depth Camera Market Share, Trends Analysis 2031 by Key Vendors—Texas Instruments, STMicroelectronics, PMD Technologies, Infineon, PrimeSense (Apple). 2025. Available online: https://www.openpr.com/news/3903319/latest-size-depth-camera-market-share-trends-analysis-2031# (accessed on 10 July 2025).
- Park, S.; Park, J.; Al-Masni, M.A.; Al-Antari, M.A.; Uddin, M.Z.; Kim, T.S. A depth camera-based human activity recognition via deep learning recurrent neural network for health and social care services. Procedia Comput. Sci. 2016, 100, 78–84. [Google Scholar] [CrossRef]
- Raj, R.; Kos, A. An improved human activity recognition technique based on convolutional neural network. Sci. Rep. 2023, 13, 22581. [Google Scholar] [CrossRef] [PubMed]
- Himeur, Y.; Al-Maadeed, S.; Kheddar, H.; Al-Maadeed, N.; Abualsaud, K.; Mohamed, A.; Khattab, T. Video surveillance using deep transfer learning and deep domain adaptation: Towards better generalization. Eng. Appl. Artif. Intell. 2023, 119, 105698. [Google Scholar] [CrossRef]
- Huang, X.; Cai, Z. A review of video action recognition based on 3D convolution. Comput. Electr. Eng. 2023, 108, 108713. [Google Scholar] [CrossRef]
- Maqsood, R.; Bajwa, U.I.; Saleem, G.; Raza, R.H.; Anwar, M.W. Anomaly recognition from surveillance videos using 3D convolution neural network. Multimed. Tools Appl. 2021, 80, 18693–18716. [Google Scholar] [CrossRef]
- Muhammad, K.; Ullah, H.; Obaidat, M.S.; Ullah, A.; Munir, A.; Sajjad, M.; De Albuquerque, V.H.C. AI-driven salient soccer events recognition framework for next-generation IoT-enabled environments. IEEE Internet Things J. 2021, 10, 2202–2214. [Google Scholar] [CrossRef]
- Wu, H.; Ma, X.; Li, Y. Multi-level channel attention excitation network for human action recognition in videos. Signal Process. Image Commun. 2023, 114, 116940. [Google Scholar] [CrossRef]
- Zong, M.; Wang, R.; Ma, Y.; Ji, W. Spatial and temporal saliency based four-stream network with multi-task learning for action recognition. Appl. Soft Comput. 2023, 132, 109884. [Google Scholar] [CrossRef]
- Hussain, A.; Khan, S.U.; Khan, N.; Shabaz, M.; Baik, S.W. AI-driven behavior biometrics framework for robust human activity recognition in surveillance systems. Eng. Appl. Artif. Intell. 2024, 127, 107218. [Google Scholar] [CrossRef]
- Mitsuzumi, Y.; Irie, G.; Kimura, A.; Nakazawa, A. Phase randomization: A data augmentation for domain adaptation in human action recognition. Pattern Recognit. 2024, 146, 110051. [Google Scholar] [CrossRef]
- Yin, Y.; Yang, Z.; Hu, H.; Wu, X. Universal multi-source domain adaptation for image classification. Pattern Recognit. 2022, 121, 108238. [Google Scholar] [CrossRef]
- Karthika, S.; Jane, Y.N.; Nehemiah, H.K. Spatio temporal 3D skeleton kinematic joint point classification model for human activity recognition. J. Vis. Commun. Image Represent. 2025, 110, 104471. [Google Scholar] [CrossRef]
- Jo, B.; Kim, S. Comparative analysis of OpenPose, PoseNet, and MoveNet models for pose estimation in mobile devices. Trait. Signal 2022, 39, 119. [Google Scholar] [CrossRef]
- Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar] [CrossRef]
- Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.Y.; Kot, A.C. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2684–2701. [Google Scholar] [CrossRef]
- Smaira, L.; Carreira, J.; Noland, E.; Clancy, E.; Wu, A.; Zisserman, A. A short note on the kinetics-700-2020 human action dataset. arXiv 2020, arXiv:2010.10864. [Google Scholar] [CrossRef]
- Guo, D.; Li, K.; Hu, B.; Zhang, Y.; Wang, M. Benchmarking micro-action recognition: Dataset, methods, and applications. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 6238–6252. [Google Scholar] [CrossRef]
- Li, Q.; Xie, X.; Zhang, J.; Shi, G. Recognizing human-object interactions in videos with the supervision of natural language. Neural Netw. 2025, 190, 107606. [Google Scholar] [CrossRef]
- Koppula, H.S.; Gupta, R.; Saxena, A. Learning human activities and object affordances from rgb-d videos. Int. J. Robot. Res. 2013, 32, 951–970. [Google Scholar] [CrossRef]
- Materzynska, J.; Xiao, T.; Herzig, R.; Xu, H.; Wang, X.; Darrell, T. Something-else: Compositional action recognition with spatial-temporal interaction networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1049–1059. [Google Scholar] [CrossRef]
- Elnady, M.; Abdelmunim, H.E. A novel YOLO LSTM approach for enhanced human action recognition in video sequences. Sci. Rep. 2025, 15, 17036. [Google Scholar] [CrossRef] [PubMed]
- Soomro, K.; Zamir, A.R.; Shah, M. A dataset of 101 human action classes from videos in the wild. arXiv 2012, arXiv:1212.0402. [Google Scholar] [CrossRef]
- Schuldt, C.; Laptev, I.; Caputo, B. Recognizing human actions: A local SVM approach. In Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, Cambridge, UK, 26 August 2004; IEEE: Piscataway, NJ, USA, 2004; Volume 3, pp. 32–36. [Google Scholar] [CrossRef]
- Gorelick, L.; Blank, M.; Shechtman, E.; Irani, M.; Basri, R. Actions as space-time shapes. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 2247–2253. [Google Scholar] [CrossRef] [PubMed]
- Weinland, D.; Ronfard, R.; Boyer, E. Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst. 2006, 104, 249–257. [Google Scholar] [CrossRef]
- Chen, D.; Chen, M.; Wu, P.; Wu, M.; Zhang, T.; Li, C. Two-stream spatio-temporal GCN-transformer networks for skeleton-based action recognition. Sci. Rep. 2025, 15, 4982. [Google Scholar] [CrossRef]
- Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.E.; Sheikh, Y. Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 172–186. [Google Scholar] [CrossRef]
- Liu, C.; Hu, Y.; Li, Y.; Song, S.; Liu, J. PKU-MMD: A large scale benchmark for skeleton-based human action understanding. In Proceedings of the Workshop on Visual Analysis in Smart and Connected Communities, Mountain View, CA, USA, 23 October 2017; pp. 1–8. [Google Scholar] [CrossRef]
- Ofli, F.; Chaudhry, R.; Kurillo, G.; Vidal, R.; Bajcsy, R. Berkeley MHAD: A comprehensive multimodal human action database. In Proceedings of the 2013 IEEE workshop on applications of computer vision (WACV), Clearwater Beach, FL, USA, 15–17 January 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 53–60. [Google Scholar] [CrossRef]
- Ranieri, C.M.; MacLeod, S.; Dragone, M.; Vargas, P.A.; Romero, R.A.F. Activity recognition for ambient assisted living with videos, inertial units and ambient sensors. Sensors 2021, 21, 768. [Google Scholar] [CrossRef]
- Wang, J.; Nie, X.; Xia, Y.; Wu, Y.; Zhu, S.C. Cross-view action modeling, learning and recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2649–2656. [Google Scholar] [CrossRef]
- Chen, C.; Jafari, R.; Kehtarnavaz, N. UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In Proceedings of the 2015 IEEE International conference on image processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 168–172. [Google Scholar] [CrossRef]
- Das, S.; Dai, R.; Koperski, M.; Minciullo, L.; Garattoni, L.; Bremond, F.; Francesca, G. Toyota smarthome: Real-world activities of daily living. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 833–842. [Google Scholar] [CrossRef]
- Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The kinetics human action video dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar] [CrossRef]
- Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; Serre, T. HMDB: A large video database for human motion recognition. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 2556–2563. [Google Scholar] [CrossRef]
- Goyal, R.; Ebrahimi Kahou, S.; Michalski, V.; Materzynska, J.; Westphal, S.; Kim, H.; Haenel, V.; Fruend, I.; Yianilos, P.; Mueller-Freitag, M.; et al. The “something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5842–5850. [Google Scholar] [CrossRef]
- Reddy, K.K.; Shah, M. Recognizing 50 human action categories of web videos. Mach. Vis. Appl. 2013, 24, 971–981. [Google Scholar] [CrossRef]
- Liu, J.; Luo, J.; Shah, M. Recognizing realistic actions from videos “in the wild”. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 1996–2003. [Google Scholar] [CrossRef]
- Wang, J.; Liu, Z.; Wu, Y.; Yuan, J. Mining actionlet ensemble for action recognition with depth cameras. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 1290–1297. [Google Scholar] [CrossRef]
- Jhuang, H.; Gall, J.; Zuffi, S.; Schmid, C.; Black, M.J. Towards understanding action recognition. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 3192–3199. [Google Scholar] [CrossRef]
- Yue, R.; Tian, Z.; Du, S. Action recognition based on RGB and skeleton data sets: A survey. Neurocomputing 2022, 512, 287–306. [Google Scholar] [CrossRef]
- Bruce, X.; Liu, Y.; Zhang, X.; Zhong, S.h.; Chan, K.C. Mmnet: A model-based multimodal network for human action recognition in rgb-d videos. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 3522–3538. [Google Scholar] [CrossRef]
- Kumar, R.; Kumar, S. Multi-view multi-modal approach based on 5s-cnn and bilstm using skeleton, depth and rgb data for human activity recognition. Wirel. Pers. Commun. 2023, 130, 1141–1159. [Google Scholar] [CrossRef]
- Batool, M.; Alotaibi, M.; Alotaibi, S.R.; AlHammadi, D.A.; Jamal, M.A.; Jalal, A.; Lee, B. Multimodal Human Action Recognition Framework using an Improved CNNGRU Classifier. IEEE Access 2024, 12, 158388–158406. [Google Scholar] [CrossRef]
- Tian, Y.; Chen, W. MEMS-based human activity recognition using smartphone. In Proceedings of the 2016 35th Chinese Control Conference (CCC), Chengdu, China, 27–29 July 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 3984–3989. [Google Scholar] [CrossRef]
- William, P.; Lanke, G.R.; Bordoloi, D.; Shrivastava, A.; Srivastavaa, A.P.; Deshmukh, S.V. Assessment of human activity recognition based on impact of feature extraction prediction accuracy. In Proceedings of the 2023 4th International Conference on Intelligent Engineering and Management (ICIEM), London, UK, 9–11 May 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar] [CrossRef]
- Mekruksavanich, S.; Jantawong, P.; Jitpattanakul, A. Deep learning approaches for har of daily living activities using imu sensors in smart glasses. In Proceedings of the 2023 Joint International Conference on Digital Arts, Media and Technology with ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (ECTI DAMT & NCON), Phitsanulok, Thailand, 22–25 March 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 474–478. [Google Scholar] [CrossRef]
- Jantawong, P.; Hnoohom, N.; Jitpattanakul, A.; Mekruksavanich, S. A lightweight deep learning network for sensor-based human activity recognition using imu sensors of a low-power wearable device. In Proceedings of the 2021 25th International Computer Science and Engineering Conference (ICSEC), Chiang Rai, Thailand, 18–20 November 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 459–463. [Google Scholar] [CrossRef]
- Liu, D.; Meng, F.; Mi, J.; Ye, M.; Li, Q.; Zhang, J. SAM-Net: Semantic-assisted multimodal network for action recognition in RGB-D videos. Pattern Recognit. 2025, 168, 111725. [Google Scholar] [CrossRef]
- Song, S.; Liu, J.; Li, Y.; Guo, Z. Modality compensation network: Cross-modal adaptation for action recognition. IEEE Trans. Image Process. 2020, 29, 3957–3969. [Google Scholar] [CrossRef]
- Liu, D.; Meng, F.; Xia, Q.; Ma, Z.; Mi, J.; Gan, Y.; Ye, M.; Zhang, J. Temporal cues enhanced multimodal learning for action recognition in RGB-D videos. Neurocomputing 2024, 594, 127882. [Google Scholar] [CrossRef]
- Demir, U.; Rawat, Y.S.; Shah, M. Tinyvirat: Low-resolution video action recognition. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 7387–7394. [Google Scholar] [CrossRef]
- Wu, C.Y.; Zaheer, M.; Hu, H.; Manmatha, R.; Smola, A.J.; Krähenbühl, P. Compressed video action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6026–6035. [Google Scholar] [CrossRef]
- Fan, L.; Buch, S.; Wang, G.; Cao, R.; Zhu, Y.; Niebles, J.C.; Fei-Fei, L. Rubiksnet: Learnable 3d-shift for efficient video action recognition. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 505–521. [Google Scholar] [CrossRef]
- Liu, G.; Qian, J.; Wen, F.; Zhu, X.; Ying, R.; Liu, P. Action recognition based on 3d skeleton and rgb frame fusion. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 258–264. [Google Scholar] [CrossRef]
- Kim, S.; Yun, K.; Park, J.; Choi, J.Y. Skeleton-based action recognition of people handling objects. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 61–70. [Google Scholar] [CrossRef]
- Phang, J.T.S.; Lim, K.H. Real-time multi-camera multi-person action recognition using pose estimation. In Proceedings of the 3rd International Conference on Machine Learning and Soft Computing, Da Lat, Vietnam, 25–28 January 2019; pp. 175–180. [Google Scholar] [CrossRef]
- Gilbert, A.; Illingworth, J.; Bowden, R. Fast realistic multi-action recognition using mined dense spatio-temporal features. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 925–931. [Google Scholar] [CrossRef]
- Angelini, F.; Fu, Z.; Long, Y.; Shao, L.; Naqvi, S.M. 2D pose-based real-time human action recognition with occlusion-handling. IEEE Trans. Multimed. 2019, 22, 1433–1446. [Google Scholar] [CrossRef]
- Papadopoulos, G.T.; Daras, P. Human action recognition using 3d reconstruction data. IEEE Trans. Circuits Syst. Video Technol. 2016, 28, 1807–1823. [Google Scholar] [CrossRef]
- Huynh-The, T.; Hua, C.H.; Kim, D.S. Encoding pose features to images with data augmentation for 3-D action recognition. IEEE Trans. Ind. Inform. 2019, 16, 3100–3111. [Google Scholar] [CrossRef]
- Su, K.; Liu, X.; Shlizerman, E. Predict & cluster: Unsupervised skeleton based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; Seattle, Washington, USA, 14–19 June 2020, pp. 9631–9640. [CrossRef]
- Hochberg, L.; Bacher, D.; Jarosiewicz, B.; Masse, N.; Simeral, J.; Vogel, J.; Haddadin, S.; Liu, J.; Cash, S.; van der Smagt, P.; et al. Reach and grasp by people with tetraplegia using a neurally controlled robotic arm. Nature 2012, 485, 372–375. [Google Scholar] [CrossRef]
- Ang, K.; Chua, K.; Phua, K.S.; Wang, C.; Chin, Z.; Kuah, C.; Low, W.; Guan, C. A Randomized Controlled Trial of EEG-Based Motor Imagery Brain-Computer Interface Robotic Rehabilitation for Stroke. Clin. EEG Neurosci. Off. J. EEG Clin. Neurosci. Soc. (ENCS) 2015, 46, 310–320. [Google Scholar] [CrossRef] [PubMed]
- Nchekwube, D.; Iarlori, S.; Monteriù, A. An assistive robot in healthcare scenarios requiring monitoring and rules regulation: Exploring Pepper use case. In Proceedings of the 2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Istanbul, Turkey, 5–8 December 2023; IEEE: Piscataway, NJ, USA, 2023; Volume 12, pp. 4812–4819. [Google Scholar] [CrossRef]
- Wolpaw, J.R.; Birbaumer, N.; McFarland, D.J.; Pfurtscheller, G.; Vaughan, T.M. Brain–computer interfaces for communication and control. Clin. Neurophysiol. 2002, 113, 767–791. [Google Scholar] [CrossRef] [PubMed]
- Omer, K.; Vella, F.; Ferracuti, F.; Freddi, A.; Iarlori, S.; Monteriù, A. Mental Fatigue Evaluation for Passive and Active BCI Methods for Wheelchair-Robot During Human-in-the-Loop Control. In Proceedings of the 2023 IEEE International Conference on Metrology for eXtended Reality, Artificial Intelligence and Neural Engineering (MetroXRAINE), Milan, Italy, 25–27 December 2023; IEEE: Piscataway, NJ, USA, 2023; Volume 10, pp. 787–792. [Google Scholar] [CrossRef]
- Moaveninejad, S.; D’Onofrio, V.; Tecchio, F.; Ferracuti, F.; Iarlori, S.; Monteriù, A.; Porcaro, C. Fractal Dimension as a discriminative feature for high accuracy classification in motor imagery EEG-based brain-computer interface. Comput. Methods Programs Biomed. 2024, 244, 107944. [Google Scholar] [CrossRef] [PubMed]
- Wolpaw, J.; Wolpaw, E.W. Brain Computer Interfaces: Principles and Practice; Oxford University Press: Oxford, UK, 2012. [Google Scholar] [CrossRef]
- Streitz, N.A. From Human–Computer Interaction to Human–Environment Interaction: Ambient Intelligence and the Disappearing Computer. In Proceedings of the Universal Access in Ambient Intelligence Environments, Austria, Salzburg, 28 January 2007; Stephanidis, C., Pieper, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2007; pp. 3–13. [Google Scholar]
- Makeig, S.; Kothe, C.; Mullen, T.; Bigdely-Shamlo, N.; Zhang, Z.; Kreutz-Delgado, K. Evolving Signal Processing for Brain-Computer Interfaces. Proc. IEEE 2012, 100, 1567–1584. [Google Scholar] [CrossRef]
- Rao, R.P.N. Brain-Computer Interfacing: An Introduction; Cambridge University Press: Cambridge, UK, 2013. [Google Scholar]
- Netzer, E.; Geva, A. Human-in-the-loop active learning via brain computer interface. Ann. Math. Artif. Intell. 2020, 88, 1191–1205. [Google Scholar] [CrossRef]
- Ferracuti, F.; Freddi, A.; Iarlori, S.; Monteriù, A.; Omer, K.I.M.; Porcaro, C. A human-in-the-loop approach for enhancing mobile robot navigation in presence of obstacles not detected by the sensory set. Front. Robot. AI 2022, 9, 2022. [Google Scholar] [CrossRef]
- Gemborn Nilsson, M.; Tufvesson, P.; Heskebeck, F.; Johansson, M. An open-source human-in-the-loop BCI research framework: Method and design. Front. Hum. Neurosci. 2023, 17, 2023. [Google Scholar] [CrossRef]
- Venot, T.; Desbois, A.; Corsi, M.C.; Hugueville, L.; Saint-Bauzel, L.; De Vico Fallani, F. Intentional binding for noninvasive BCI control. J. Neural Eng. 2024, 21, 046026. [Google Scholar] [CrossRef]
- Ji, Z.; Liu, Q.; Xu, W.; Yao, B.; Liu, J.; Zhou, Z. A Closed-Loop Brain-Computer Interface with Augmented Reality Feedback for Industrial Human-Robot Collaboration. Int. J. Adv. Manuf. Technol. 2023, 124, 3083–3098. [Google Scholar] [CrossRef]
- Aydarkhanov, R.; Ušćumlić, M.; Chavarriaga, R.; Gheorghe, L.; del R Millán, J. Closed-loop EEG study on visual recognition during driving. J. Neural Eng. 2021, 18, 026010. [Google Scholar] [CrossRef]
- Gao, H.; Luo, L.; Pi, M.; Li, Z.; Li, Q.; Zhao, K.; Huang, J. EEG-Based Volitional Control of Prosthetic Legs for Walking in Different Terrains. IEEE Trans. Autom. Sci. Eng. 2021, 18, 530–540. [Google Scholar] [CrossRef]
- Omer, K.; Ferracuti, F.; Freddi, A.; Iarlori, S.; Vella, F.; Monteriù, A. Real-Time Mobile Robot Obstacles Detection and Avoidance Through EEG Signals. Brain Sci. 2025, 15, 359. [Google Scholar] [CrossRef] [PubMed]
- Xu, B.; Li, W.; Liu, D.; Zhang, K.; Miao, M.; Xu, G.; Song, A. Continuous Hybrid BCI Control for Robotic Arm Using Noninvasive Electroencephalogram, Computer Vision, and Eye Tracking. Mathematics 2022, 10, 618. [Google Scholar] [CrossRef]
- Diraco, G.; Rescio, G.; Siciliano, P.; Leone, A. Review on Human Action Recognition in Smart Living: Sensing Technology, Multimodality, Real-Time Processing, Interoperability, and Resource-Constrained Processing. Sensors 2023, 23, 5281. [Google Scholar] [CrossRef]
- Pereira, R.; Cruz, A.; Garrote, L.; Pires, G.; Lopes, A.; Nunes, U.J. Dynamic environment-based visual user interface system for intuitive navigation target selection for brain-actuated wheelchairs. In Proceedings of the 2022 31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), Napoli, Italy, 29 August–2 September 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 198–204. [Google Scholar]
- Sun, H.; Li, C.; Zhang, H. Image Segmentation-P300 Selector: A Brain–Computer Interface System for Target Selection. Comput. Mater. Contin. 2024, 79, 2505. [Google Scholar] [CrossRef]
- Mezzina, G.; De Venuto, D. Smart Sensors HW/SW Interface based on Brain-actuated Personal Care Robot for Ambient Assisted Living. In Proceedings of the 2020 IEEE SENSORS, Rotterdam, The Netherlands, 25–28 October 2020; pp. 1–4. [Google Scholar] [CrossRef]
- Ban, N.; Xie, S.; Qu, C.; Chen, X.; Pan, J. Multifunctional robot based on multimodal brain-machine interface. Biomed. Signal Process. Control 2024, 91, 106063. [Google Scholar] [CrossRef]
- Muñoz, J.E.; Chavarriaga, R.; Villada, J.F.; SebastianLopez, D. BCI and motion capture technologies for rehabilitation based on videogames. In Proceedings of the IEEE Global Humanitarian Technology Conference (GHTC 2014), San Jose, CA, USA, 10–13 October 2014; pp. 396–401. [Google Scholar] [CrossRef]
- Chen, C.; Jafari, R.; Kehtarnavaz, N. A Real-Time Human Action Recognition System Using Depth and Inertial Sensor Fusion. IEEE Sens. J. 2016, 16, 773–781. [Google Scholar] [CrossRef]
- Feng, X.; Weng, Y.; Li, W.; Chen, P.; Zheng, H. DAMUN: A Domain Adaptive Human Activity Recognition Network Based on Multimodal Feature Fusion. IEEE Sens. J. 2023, 23, 22019–22030. [Google Scholar] [CrossRef]
- Abbasi, H.F.; Ahmed Abbasi, M.; Jianbo, S.; Liping, X.; Yu, X. TriNet: A Hybrid Feature Integration Approach for Motor Imagery Classification in Brain-Computer Interface. IEEE Access 2025, 13, 115406–115418. [Google Scholar] [CrossRef]
- Mir, A.A.; Khalid, A.S.; Musa, S.; Faizal Ahmad Fauzi, M.; Norfiza Abdul Razak, N.; Boon Tang, T. Machine Learning in Ambient Assisted Living for Enhanced Elderly Healthcare: A Systematic Literature Review. IEEE Access 2025, 13, 110508–110527. [Google Scholar] [CrossRef]
- Tidoni, E.; Gergondet, P.; Fusco, G.; Kheddar, A.; Aglioti, S.M. The Role of Audio-Visual Feedback in a Thought-Based Control of a Humanoid Robot: A BCI Study in Healthy and Spinal Cord Injured People. IEEE Trans. Neural Syst. Rehabil. Eng. 2017, 25, 772–781. [Google Scholar] [CrossRef] [PubMed]
- Zhang, X.; Zhang, T.; Jiang, Y.; Zhang, W.; Lu, Z.; Wang, Y.; Tao, Q. A novel brain-controlled prosthetic hand method integrating AR-SSVEP augmentation, asynchronous control, and machine vision assistance. Heliyon 2024, 10, e26521. [Google Scholar] [CrossRef]
- Bellicha, A.; Struber, L.; Pasteau, F.; Juillard, V.; Devigne, L.; Karakas, S.; Chabardes, S.; Babel, M.; Charvet, G. Depth-sensor-based shared control assistance for mobility and object manipulation: Toward long-term home-use of BCI-controlled assistive robotic devices. J. Neural Eng. 2025, 22, 016045. [Google Scholar] [CrossRef]
- Li, S.; Wang, H.; Chen, X.; Wu, D. Multimodal Brain-Computer Interfaces: AI-powered Decoding Methodologies. arXiv 2025, arXiv:2502.02830. [Google Scholar] [CrossRef]
- Zakka, V.G.; Dai, Z.; Manso, L.J. Action Recognition in Real-World Ambient Assisted Living Environment. Big Data Min. Anal. 2025, 8, 914–932. [Google Scholar] [CrossRef]
- Caroleo, G.; Albini, A.; Maiolino, P. Soft Robot Localization Using Distributed Miniaturized Time-of-Flight Sensors. In Proceedings of the 2025 IEEE 8th International Conference on Soft Robotics (RoboSoft), Lausanne, Switzerland, 22–26 April 2025; pp. 1–6. [Google Scholar] [CrossRef]
- Ding, R.; Hovine, C.; Callemeyn, P.; Kraft, M.; Bertrand, A. A Wireless, Scalable, and Modular EEG Sensor Network Platform for Unobtrusive Brain Recordings. IEEE Sens. J. 2025, 25, 22580–22590. [Google Scholar] [CrossRef]
- Dickey, J. The Rise of Neurotech and the Risks for Our Brain Data: Privacy and Security Challenges—Future Security. March 2025. Available online: https://www.newamerica.org/future-security/reports/the-rise-of-neurotech-and-the-risks-for-our-brain-data/privacy-and-security-challenges/ (accessed on 12 July 2025).
- Xia, K.; Duch, W.; Sun, Y.; Xu, K.; Fang, W.; Luo, H.; Zhang, Y.; Sang, D.; Xu, X.; Wang, F.Y.; et al. Privacy-Preserving Brain–Computer Interfaces: A Systematic Review. IEEE Trans. Comput. Soc. Syst. 2023, 10, 2312–2324. [Google Scholar] [CrossRef]
- Wu, H.; Ma, Z.; Guo, Z.; Wu, Y.; Zhang, J.; Zhou, G.; Long, J. Online Privacy-Preserving EEG Classification by Source-Free Transfer Learning. IEEE Trans. Neural Syst. Rehabil. Eng. 2024, 32, 3059–3070. [Google Scholar] [CrossRef]
- Bechtold, U.; Stauder, N.; Fieder, M. Attitudes towards Technology: Insights on Rarely Discussed Influences on Older Adults’ Willingness to Adopt Active Assisted Living (AAL). Int. J. Environ. Res. Public Health 2024, 21, 628. [Google Scholar] [CrossRef]
- Botchway, B.; Ghansah, F.A.; Edwards, D.J.; Kumi-Amoah, E.; Amo-Larbi, J. Critical Smart Functions for Smart Living Based on User Perspectives. Buildings 2025, 15, 2727. [Google Scholar] [CrossRef]
- Bastardo, R.; Martins, A.I.; Pavão, J.; Silva, A.G.; Rocha, N.P. Methodological Quality of User-Centered Usability Evaluation of Ambient Assisted Living Solutions: A Systematic Literature Review. Int. J. Environ. Res. Public Health 2021, 18, 11507. [Google Scholar] [CrossRef] [PubMed]
- Padfield, N.; Anastasi, A.A.; Camilleri, T.; Fabri, S.; Bugeja, M.; Camilleri, K. BCI-controlled wheelchairs: End-users’ perceptions, needs, and expectations, an interview-based study. Disabil. Rehabil. Assist. Technol. 2024, 19, 1539–1551. [Google Scholar] [CrossRef] [PubMed]
- Kristen Mathews, T.S. Unlocking Neural Privacy: The Legal and Ethical Frontiers of Neural Data. March 2025. Available online: https://www.cooley.com/news/insight/2025/2025-03-13-unlocking-neural-privacy-the-legal-and-ethical-frontiers-of-neural-data (accessed on 12 July 2025).
- Yang, H.; Jiang, L. Regulating neural data processing in the age of BCIs: Ethical concerns and legal approaches. Digit. Health 2025, 11, 20552076251326123. [Google Scholar] [CrossRef]
- Ochang, P.; Eke, D.; Stahl, B.C. Perceptions on the Ethical and Legal Principles that Influence Global Brain Data Governance. Neuroethics 2024, 17, 23. [Google Scholar] [CrossRef]
- Sudhakaran, S.; Escalera, S.; Lanz, O. Gate-shift-fuse for video action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10913–10928. [Google Scholar] [CrossRef]
- Maruotto, I.; Ciliberti, F.K.; Gargiulo, P.; Recenti, M. Feature Selection in Healthcare Datasets: Towards a Generalizable Solution. Comput. Biol. Med. 2025, 196, 110812. [Google Scholar] [CrossRef]
- Wu, J.; Leng, C.; Wang, Y.; Hu, Q.; Cheng, J. Quantized convolutional neural networks for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4820–4828. [Google Scholar] [CrossRef]
- Qiao, D.; Wang, Z.; Liu, J.; Chen, X.; Zhang, D.; Zhang, M. EECF: An Edge-End Collaborative Framework with Optimized Lightweight Model. Expert Syst. Appl. 2025, 297, 129319. [Google Scholar] [CrossRef]
- Hajhassani, D.; Barthélemy, Q.; Mattout, J.; Congedo, M. Improved Riemannian potato field: An Automatic Artifact Rejection Method for EEG. Biomed. Signal Process. Control 2026, 112, 108505. [Google Scholar] [CrossRef]
- Iosifidis, A.; Tefas, A.; Pitas, I. Multi-view human action recognition under occlusion based on fuzzy distances and neural networks. In Proceedings of the 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), Bucharest, Romania, 27–31 August 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 1129–1133. [Google Scholar]
- Bai, G.; Yan, H.; Liu, W.; Deng, Y.; Dong, E. Towards Lightest Low-Light Image Enhancement Architecture for Mobile Devices. Expert Syst. Appl. 2025, 296, 129125. [Google Scholar] [CrossRef]
- Kaveh, R.; Doong, J.; Zhou, A.; Schwendeman, C.; Gopalan, K.; Burghardt, F.L.; Arias, A.C.; Maharbiz, M.M.; Muller, R. Wireless user-generic ear EEG. IEEE Trans. Biomed. Circuits Syst. 2020, 14, 727–737. [Google Scholar] [CrossRef]
- Paul, A.; Lee, M.S.; Xu, Y.; Deiss, S.R.; Cauwenberghs, G. A versatile in-ear biosensing system and body-area network for unobtrusive continuous health monitoring. IEEE Trans. Biomed. Circuits Syst. 2023, 17, 483–494. [Google Scholar] [CrossRef] [PubMed]
- Lombardi, I.; Buono, M.; Giugliano, G.; Senese, V.P.; Capece, S. Usability and Acceptance Analysis of Wearable BCI Devices. Appl. Sci. 2025, 15, 3512. [Google Scholar] [CrossRef]
- Zanini, P.; Congedo, M.; Jutten, C.; Said, S.; Berthoumieu, Y. Transfer learning: A Riemannian geometry framework with applications to brain–computer interfaces. IEEE Trans. Biomed. Eng. 2017, 65, 1107–1116. [Google Scholar] [CrossRef] [PubMed]
- Carboni, A.; Russo, D.; Moroni, D.; Barsocchi, P. Privacy by design in systems for assisted living, personalised care, and wellbeing: A stakeholder analysis. Front. Digit. Health 2023, 4, 2022. [Google Scholar] [CrossRef]
- Kumar, D.; Kumar, C.; Seah, C.W.; Xia, S.; Shao, M. Finding Achilles’ Heel: Adversarial Attack on Multi-modal Action Recognition. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 3829–3837. [Google Scholar] [CrossRef]
Activity | Vision-Based Devices | Wearable Devices |
---|---|---|
Body movements | RGB-D cameras, video cameras, thermal cameras | IMUs, electromyography sensors, photoplethysmogram sensors, skin conductance sensors |
Hand gestures (including eating and drinking) | RGB-D cameras, video cameras, thermal cameras | IMUs, electromyography sensors |
Sleeping | RGB-D cameras | photoplethysmogram sensors, skin conductance sensors |
Standing/falling | RGB-D cameras, video cameras | IMUs, electromyography sensors |
Sitting with mental occupation (including studying and working) | RGB-D cameras, video cameras | BCIs, photoplethysmogram sensors, skin conductance sensors |
Advantages | Disadvantages |
---|---|
reduced size | sensitivity to thermal noise and lighting variations |
non-invasive monitoring | high storage resources for onboard data processing |
costs and performance trade-off depending on sensing principle | wide occupied bandwidth for remote data processing |
Manufacturer | Commercial Name | Depth-Sensing Principle |
---|---|---|
Intel | D400 series | stereo vision |
RealSense | L515 | LiDAR technology |
Microsoft | Kinect v1 | structured light (IR projection) |
Kinect v2 | ToF | |
Orbbec | Astra Pro | ToF (nIR) |
Others | Structure IO | structured light (IR projection) |
Asus Xtion Pro | structured light (IR projection) |
Work | Architecture | Input Data | Outcomes | Key Novelty Point |
---|---|---|---|---|
Park et al. [47] | LSTM | Skeleton data | 99.5% accuracy on MSRC-12 [47] | Leveraging time sequential encoding of activity features |
Mitsuzumi et al. [56] | GCN | Skeleton data | 67.4% accuracy on NTU-RGB+ D60 [60], 57.7% on NTU-RGB+ D120 [61] | Introducing a subject-agnostic domain adaptation, randomizing the frequency phase of motion data, leaving amplitude unchanged |
Karthika et al. [58] | A stacked ensemble model made by SBGTGCN, 2DCNN+2P-LSTM, and 3DCNN+XGBoost | Skeleton data | 97.9% accuracy on NTU-RGB+ D60 [60], 97.2% on NTU-RGB+ D120 [61], 97.5% on Kinetics-700-2020 [62], 95.2% on MA-52 [63] | Deriving 3D skeletal points by Gaussian RBF, and designing a stacked ensemble model that integrates multiple base learners and a meta-learner |
Chen et al. [72] | GCN and Transformer model | Skeleton data | 92.7% accuracy on NTU-RGB+ D60 [60], 86.8% on NTU-RGB+ D120 [61], 39.0% on Kinetic-Skeleton [80] | Design of GCN and Transformer parallel stream for extracting local and global features as topological structures and inter-joints connections |
Wu et al. [53] | 2D CNN | Video frames | 91.9% accuracy on NTU-RGB+ D60 [60], 92.2% on Kinetics-700-2020 [62], 96.9% on UFC101 [68], 73.7% on HMDB51 [81], 77.9% on STH-STH [82] | Design of a multi-level channel attention excitation module to retrieve highly discriminative video feature representation |
Zong et al. [54] | Hybrid CNN-LSTM | Video frames | 94.7% accuracy on UFC101 [68], 67.2% on HMDB51 [81], 68.7% on Kinetics-700-2020 [62] | Design of a four-stream network, based on spatial and temporal saliency detection |
Hussain et al. [55] | Hybrid CNN-LSTM | Video frames | 98.7% accuracy on UFC101 [68], 80.3% on HMDB51 [81], 98.9% on UCF50 [83], 98.9% on YouTube Action [84] | Design of a dynamic attention fusion unit and a temporal-spatial fusion network to extract human-centric features from temporal, spatial, and behavioral dependencies |
Li et al. [64] | Spatial-temporal mixed module MLP | Video frames | 93.6% accuracy on CAD-120 [65], 86.1% on STH-STH V1 [82] | Introduction of text features for human-object interaction and of supervised natural language learning for augmentation of visual feature representation |
Elnady and Abdelmunim [67] | LSTM | Video frames | 96.0% accuracy on UFC101 [68], 99.0% on KTH [69], 98.0% on IXMAS [71], 100% on WEIZMANN [70] | Combining YOLO and LSTM to integrate highly discriminative features from individual frames and sequential temporal dynamics of motion |
Work | Architecture | Input Data | Outcomes | Key Novelty Point |
---|---|---|---|---|
Bruce et al. [88] | GCN and attention mechanism | Skeleton and RGB data | 93.9% accuracy on NTU-RGB+ D60 [60], 90.5% on NTU-RGB+ D120 [61], 96.3% on PKU-MMD [74], 77.5% on Toyota Smarthome [79], 93.5% on Northwestern-UCLA Multiview [77] | Multi-modal network that fuses skeleton and RGB complementary information, using GCN to learn attention weights from skeleton that are then shared with the RGB network |
Kumar and Kumar [89] | Hybrid CNN-BiLSTM | Depth, RGB, and skeleton data | 96.2% accuracy on UTD-MHAD [78] | Design of a multi-view, multi-modal framework, where RGB, depth, and skeleton data are separately processed by 5S-CNN and BiLSTM networks, and the outputs are fused by a weighted product model |
Batool et al. [90] | CNNGRU | RGB, depth, and inertial data | 97.9% accuracy on HWU-USP [76], 97.9% on Berkeley-MHAD [75], 96.6% on NTU-RGB+ D60 [60], 95.9% on NTU-RGB+ D120 [61], 97.9% on UTD-MHAD [78] | Introducing novel features extracted from RGB, depth, and inertial data, where redundant information is profiled out by a genetic algorithm |
Liu et al. [95] | Hybrid GCN-CNN | Skeleton and RGB-D video | 94.8% accuracy on NTU-RGB+ D60 [60], 97.0% on PKU-MMD [74], 93.7% Northwestern-UCLA Multiview [77] | Introducing a semantic-assisted framework where text modality is added to RGB and skeleton, with a visual-language module with contrastive language-image pretraining |
Song et al. [96] | Hybrid CNN-LSTM | Skeleton and RGB-D video | 93.8% accuracy on NTU-RGB+ D60 [60], 76.9% on MSR-3D [85], 85.7% on UCF101 [68], 64.8% on JHMDB [86] | Introducing a modality compensation network for fusing multiple representation modalities, including adaptation schemes for narrowing the distance between different modalities distributions |
Liu et al. [97] | Hybrid GCN-CNN-RNN | Skeleton and RGB-D video | 94.3% accuracy on NTU-RGB+ D60 [60], 96.8% on PKU-MMD [74], 93.9% on Northwestern-UCLA Multiview [77] | Designing of a temporal cues enhancement module for improving temporal modeling from RGB modality |
Company/Product | Brief Description | Applications |
---|---|---|
Emotiv (EPOC X, MN8 EEG) | Wireless EEG headsets | Neuroscience research, neurofeedback, gaming, cognitive wellness monitoring |
OpenBCI (Ultracortex Mark IV, Cyton, Galea) | Open-source BCI hardware and software | Research, development, hobbyist projects, gaming, AR/VR/XR with integrated biosensing |
g.tec (Unicorn Brain Interface, g.HIamp PRO) | Complete BCI systems and components (amplifiers, EEG headsets) | Research, rehabilitation, clinical applications |
NeuroMaker BCI | BCI kit for educational purposes | Learning about neuroscience, visualizing brainwaves, mind-controlled games |
PiEEG | Low-cost EEG devices and BCI kits | Learning, research, development |
Neurable | BCI technology for controlling digital objects | Gaming, augmented/virtual reality, thought control |
NextMind (by Snap) | BCI technology for decoding neural activity | Controlling digital objects (acquired by Snap, focus on AR/VR) |
Advantages | Disadvantages |
---|---|
customized properties for users | required training phase on users |
invasive/non-invasive mode depending on acquired signals | sensitivity to non-linearity and noise |
signals monitored by multiple channels | non-stationary acquisition process |
Study | Objective | Method | Population | Results |
---|---|---|---|---|
Pereira et al. [129] | A dynamic visual interface to navigate the indoor environment, which exploits RGB-D perception to improve BCI-based actuation of a wheelchair | RGB-D images and BCI signals, separately processed, are merged in a dynamic visual interface. User intent gathered by the P300 device is matched to bounding boxes detected on RGB-D images | 5 participants (23–31 years old, 3 males, 2 females) | 89.9% and 86.4% an average accuracy, respectively, for non-self-paced and self-paced selection of 30 predefined target events, with average effective symbol per minute (eSPM) of 4.8 and 4.7 |
Mezzina et al. [131] | A smart sensor system to implement a BCI-controlled human–robot interface for AAL | Direct communication path between the human brain and external actuator. BCI decodes user intention by a fast classifier; heterogeneous sensors (RGB-D cameras, sonar, and IR sensors) onboard a personal care robot are exploited to ensure correct actuation of user intent | 4 participants (26 ± 1 years old) | 84% accuracy in user intent classification—75% success rate in correct actuation execution |
Ban et al. [132] | A multifunctional (10 actions) NAO v6 robot control system based on a multi-modal brain–machine interface (BMI) that fuses three signals: steady-state visual evoked potential (SSVEP), electrooculography (EOG), and gyroscope | Hybrid convolutional neural network—bidirectional—long short-term memory (CNN-Bi-LSTM) architecture based on attention modules, to extract temporal information from sequence data and enable classification | 16 participants (19–25 years old, 8 males, 8 females) | 93.78% accuracy in completion of complex tasks, by all participants—average response time of 3 |
Muñoz et al. [133] | A cost-effective rehabilitation system (named brain–kinect interface, BKI) based on videogames and multi-modal recordings of physiological signals (by a consumer-level EEG device + Kinect) | A gesture interaction module (using Kinect) and a BCI dynamically monitor physiological variables of patients while they are playing selected exergames for rehabilitation | Not specified for the proposed system validation—up to 700 patients involved in exergames only | Technical figures not reported—improvements in postural balance (+15% balance time increase) and range of motion (up to +18%) |
Tidoni et al. [138] | Visual information and auditory feedback fused to improve the BCI-based remote control of a humanoid surrogate (HRP-2 robot) by people with sensorimotor disorders | An SSVEP BCI classifier decodes user intent that is associated with visually recognized objects captured by the embedded robot camera. Objects are paired with offline pre-defined tasks, triggered by the user SSVEP | 14 healthy participants (25.8 ± 6.0 years old, 6 females) and 3 subjects who had suffered traumatic spinal cord injury (22–31 years old, 3 males) | Action-related feedback may improve subject information processing and decisions about when to start an action—no significant differences in task completion time and placing objects accuracy |
Zhang et al. [139] | To improve the interaction and practical application of a prosthetic hand with a BCI system, by integrating augmented reality (AR) technology | An asynchronous pattern recognition algorithm, combining center extended canonical correlation analysis and support vector machine (Center-ECCA-SVM), is proposed, together with a deep learning object detection algorithm (YOLOv4) to improve the level of user interaction in 8 pre-defined control modes | 12 participants | 96.7% average stimulus pattern recognition accuracy—96.4% confidence in prosthetic hand real-time detection by YOLOv4 tiny model |
Bellicha et al. [140] | An assist-as-needed sensor-based shared control (SC) method relying on the blending of BCI and depth-sensor-based control targeting mobility and manipulation needs in home settings | Shared control (SC): control shared between user command (retrieved from a control interface) and sensor-based control by features detected by robot-embedded sensors | A quadriplegic patient | Time to perform tasks and number of changes in mental tasks were reduced, unwanted actions avoided |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Iadarola, G.; Mengarelli, A.; Iarlori, S.; Monteriù, A.; Spinsante, S. RGB-D Cameras and Brain–Computer Interfaces for Human Activity Recognition: An Overview. Sensors 2025, 25, 6286. https://doi.org/10.3390/s25206286
Iadarola G, Mengarelli A, Iarlori S, Monteriù A, Spinsante S. RGB-D Cameras and Brain–Computer Interfaces for Human Activity Recognition: An Overview. Sensors. 2025; 25(20):6286. https://doi.org/10.3390/s25206286
Chicago/Turabian StyleIadarola, Grazia, Alessandro Mengarelli, Sabrina Iarlori, Andrea Monteriù, and Susanna Spinsante. 2025. "RGB-D Cameras and Brain–Computer Interfaces for Human Activity Recognition: An Overview" Sensors 25, no. 20: 6286. https://doi.org/10.3390/s25206286
APA StyleIadarola, G., Mengarelli, A., Iarlori, S., Monteriù, A., & Spinsante, S. (2025). RGB-D Cameras and Brain–Computer Interfaces for Human Activity Recognition: An Overview. Sensors, 25(20), 6286. https://doi.org/10.3390/s25206286