Skip Content
You are currently on the new version of our website. Access the old version .
SensorsSensors
  • This is an early access version, the complete PDF, HTML, and XML versions will be available soon.
  • Article
  • Open Access

29 January 2026

A Lightweight Radar–Camera Fusion Deep Learning Model for Human Activity Recognition

and
Department of Information and Communication Engineering, Korea University of Technology and Education (KOREATECH), Cheonan-si 31253, Republic of Korea
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue AI-Based Computer Vision Sensors & Systems—2nd Edition

Abstract

Human activity recognition in privacy-sensitive indoor environments requires sensing modalities that remain robust under illumination variation and background clutter while preserving user anonymity. To this end, this study proposes a lightweight radar–camera fusion deep learning model that integrates motion signatures from FMCW radar with coarse spatial cues from ultra-low-resolution camera frames. The radar stream is processed as a Range–Doppler–Time cube, where each frame is flattened and sequentially encoded using a Transformer-based temporal model to capture fine-grained micro-Doppler patterns. The visual stream employs a privacy-preserving 4×5-pixel camera input, from which a temporal sequence of difference frames is extracted and modeled with a dedicated camera Transformer encoder. The two modality-specific feature vectors—each representing the temporal dynamics of motion—are concatenated and passed through a lightweight fully connected classifier to predict human activity categories. A multimodal dataset of synchronized radar cubes and ultra-low-resolution camera sequences across 15 activity classes was constructed for evaluation. Experimental results show that the proposed fusion model achieves 98.74% classification accuracy, significantly outperforming single-modality baselines (single-radar and single-camera). Despite its performance, the entire model requires only 11 million floating-point operations (11 MFLOPs), making it highly efficient for deployment on embedded or edge devices.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.