Next Article in Journal
HBEVOcc: Height-Aware Bird’s-Eye-View Representation for 3D Occupancy Prediction from Multi-Camera Images
Previous Article in Journal
Efficient Pattern Modeling Method for Parabolic Cylindrical Antennas Incorporating Multi-Source Structural Errors
 
 
Article
Peer-Review Record

Event-Based Machine Vision for Edge AI Computing

Sensors 2026, 26(3), 935; https://doi.org/10.3390/s26030935 (registering DOI)
by Paul K. J. Park 1,2, Junseok Kim 1, Juhyun Ko 1 and Yeoungjin Chang 2,*
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Sensors 2026, 26(3), 935; https://doi.org/10.3390/s26030935 (registering DOI)
Submission received: 18 December 2025 / Revised: 26 January 2026 / Accepted: 30 January 2026 / Published: 1 February 2026
(This article belongs to the Special Issue Next-Generation Edge AI in Wearable Devices)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The manuscript entitled “Event-Based Machine Vision for Edge AI Computing” addresses an important and timely topic, namely the use of event-based vision sensors (DVS) for low-latency and energy-efficient edge AI applications. The paper is well-written, technically coherent, and supported by extensive experimental results across multiple tasks (object detection, pose estimation, and hand posture recognition).

However, despite these strengths, the manuscript suffers from critical scientific limitations related to novelty, experimental rigor, comparative evaluation, and generalizability, which substantially weaken its contribution as an academic research article. In its current form, the work resembles an engineering integration and optimization study rather than a contribution that advances the state of the art in AI or computer vision.

1. Limited Novelty and Overlap with Prior Work (Critical Issue); The paper repeatedly claims efficiency and speedup advantages of event-based vision; however, most core ideas have already been well established in the literature. For example, the Introduction states “event-based sensors provide motion-centric information by reporting asynchronous brightness changes at pixels… attractive for edge AI computing” (p.2, lines 68–72). This observation is well-known and extensively covered in prior surveys and benchmarks (e.g., event-camera surveys and neuromorphic vision literature). Similarly, the claimed advantages of sparsity, reduced bandwidth, and low latency are incremental confirmations rather than new findings.

The proposed timestamp-based image generation (Section 3) is presented as a key contribution, yet the formulation “each pixel intensity represents the recency of activity rather than the number of events” (page 5) is conceptually close to previously reported time-surface and recency-based encodings widely used in event-based vision. The manuscript does not clearly articulate how this encoding is fundamentally different from existing time-surface or decay-based representations, nor does it provide a formal comparison against them.

Recommendation; The authors must explicitly clarify What is fundamentally new compared to prior time-surface and decay-based encodings? Why the proposed encoding constitutes a novel  methodological contribution, not an implementation variant?

2. Insufficient Baseline Comparisons (Major Methodological Weakness); Across all experimental sections, the evaluation lacks strong, modern baselines. For instance, in the action recognition validation (Section 3), performance is compared only against “previous technique (temporal accumulation)” (page 6, Table 1) with a marginal improvement (0.908 vs. 0.896). This comparison is insufficient to support claims of superiority, as no comparison is made with Alternative event representations, Event-based deep learning models, and Spiking neural networks or hybrid CNN-SNN approaches.

Similarly, in object detection and pose estimation, the manuscript reports impressive speedups (e.g., “more than 11 times speed-up”, page 8), yet comparisons are made only against conventional frame-based pipelines, not against state-of-the-art event-based detection or pose estimation frameworks.

Recommendation; Include quantitative comparisons with recent event-based AI methods (2023–2025), not only frame-based CNN baselines.

3. Questionable Generalizability of Experimental Results; The experiments are conducted under highly controlled and task-specific conditions, primarily focusing on Indoor environments, Human motion-centric scenarios, and Proprietary or internally collected datasets. For example, the object detection dataset “we utilized 19.8 M DVS images for the training” (page 9) yet no information is provided regarding Dataset availability, Reproducibility, and Cross-dataset validation.

Moreover, performance metrics such as recall (>95%) and FAR (<2%) are reported without confidence intervals, statistical significance analysis, or cross-validation, raising concerns about robustness.

Recommendation; The authors should Provide statistical validation (variance, confidence intervals). Discuss limitations regarding dataset bias and deployment conditions.

4. Engineering Optimization vs. AI Contribution; Large parts of Sections 4–6 focus on Network pruning, Layer reduction, Mixed-bit quantization, and Stride manipulation. Example “we aggressively quarter the kernel numbers… remove convolutional layers… roll back to the previous structure” (page 8). While these optimizations are valuable from an engineering standpoint, they do not constitute novel AI algorithms. Similar optimization strategies are widely used in edge AI deployment.

Recommendation; The manuscript should either Reframe itself clearly as a systems/engineering paper, or Introduce algorithmic novelty beyond architectural compression.

5. Terminology Precision; The term “bio-inspired computing method” (page 1) is used repeatedly without a clear definition or formal mapping to biological computation models.

6. Clarity of Figures; Figures 5 and 7 illustrate qualitative results but lack quantitative overlays (e.g., confidence scores, failure cases).

7. English Style; The language is generally clear, but some sections contain long descriptive paragraphs that could be condensed for clarity (e.g., Sections 4 and 6).

 

Author Response

Summary

Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions/corrections.

Reviewer’s comment. The manuscript entitled “Event-Based Machine Vision for Edge AI Computing” addresses an important and timely topic, namely the use of event-based vision sensors (DVS) for low-latency and energy-efficient edge AI applications. The paper is well-written, technically coherent, and supported by extensive experimental results across multiple tasks (object detection, pose estimation, and hand posture recognition). However, despite these strengths, the manuscript suffers from critical scientific limitations related to novelty, experimental rigor, comparative evaluation, and generalizability, which substantially weaken its contribution as an academic research article. In its current form, the work resembles an engineering integration and optimization study rather than a contribution that advances the state of the art in AI or computer vision.

Response: Thank you for the detailed assessment and for acknowledging the relevance and technical coherence of our work. We agree that the earlier version required clearer positioning and stronger scientific framing with respect to novelty, rigor, comparative evaluation, and generalizability. In the revised manuscript, we have addressed these concerns by (i) explicitly clarifying the methodological novelty of the proposed encoding and its design rationale relative to prior time-surface/decay representations, (ii) strengthening experimental rigor and reporting (including clearer protocols, definitions, and statistical/robustness discussions where applicable), (iii) expanding quantitative comparisons to representative recent event-based baselines and reporting system-level efficiency metrics, and (iv) explicitly discussing the application scope and generalizability limits for the intended smart-home occupancy sensing scenario. Finally, we reframed the contribution more transparently as a deployment-oriented, edge-AI engineering study with concrete methodological components and system-level evidence, which we believe better matches the manuscript’s intent and strengthens its academic contribution.

Comments 1: Limited Novelty and Overlap with Prior Work (Critical Issue); The paper repeatedly claims efficiency and speedup advantages of event-based vision; however, most core ideas have already been well established in the literature. For example, the Introduction states “event-based sensors provide motion-centric information by reporting asynchronous brightness changes at pixels… attractive for edge AI computing” (p.2, lines 68–72). This observation is well-known and extensively covered in prior surveys and benchmarks (e.g., event-camera surveys and neuromorphic vision literature). Similarly, the claimed advantages of sparsity, reduced bandwidth, and low latency are incremental confirmations rather than new findings. The proposed timestamp-based image generation (Section 3) is presented as a key contribution, yet the formulation “each pixel intensity represents the recency of activity rather than the number of events” (page 5) is conceptually close to previously reported time-surface and recency-based encodings widely used in event-based vision. The manuscript does not clearly articulate how this encoding is fundamentally different from existing time-surface or decay-based representations, nor does it provide a formal comparison against them.

Recommendation; The authors must explicitly clarify What is fundamentally new compared to prior time-surface and decay-based encodings? Why the proposed encoding constitutes a novel methodological contribution, not an implementation variant?

Response 1: We thank the reviewer for this important observation. We agree that the general advantages of event cameras (sparsity, low latency, reduced bandwidth) and the broad class of recency/time-surface encodings are well established. In the revision, we therefore (i) cite the relevant surveys and representation benchmarks and (ii) clarify the specific methodological contribution of Section 3 relative to SAE/time-surface methods (i.e., polarity-conditioned time surfaces). Our contribution is not to claim that “recency encoding” is new per se, but to introduce a deployment-oriented recency readout that is polarity-agnostic and edge-shape faithful for edge-intensity recognition tasks (object detection and posture/pose recognition) in intermittent-motion indoor scenes. Prior SAE/time-surface methods are frequently instantiated with separate surfaces per polarity (or with polarity-conditioned responses), which is useful when signed contrast and motion-direction cues are required. In our target tasks, however, the objective is to recognize the presence and geometry of edges rather than signed contrast. We therefore update a single per-pixel timestamp memory with any ON/OFF event and generate a single-channel recency image. In practical indoor recordings, ON/OFF events may be unbalanced for a given edge due to local texture, lighting, and motion. When polarity channels are handled separately, this can yield fragmented contours or visually weak edge evidence in one channel. Polarity-agnostic fusion mitigates this failure mode and yields more stable edge shapes. Thus, we added an explicit “relation to SAE and time-surface representations” discussion in Section 3, clarifying that our decay mapping is within the time-surface family but differs in its polarity-agnostic update tailored to object/posture recognition. In addition, we presented a recency-based event-image representation with an empirical Ts selection guide (20–100 ms depending on motion speed), preserving moving-edge structure while reducing redundant data for downstream processing.

à (Section 3) To use a DVS for vision tasks, the asynchronous event stream must be converted to an image-like input. A common baseline is temporal accumulation, where events are summed within a fixed time window to produce a frame image [12]. While simple, accumulation can be unreliable in edge AI settings with frequent low-motion intervals (e.g., indoor monitoring), because sparse events may yield weak contours and the representation does not explicitly encode how recently a pixel was activated. Recency-based representations address this by storing the latest event timestamp at each pixel (often referred to as a Surface of Active Events, SAE [13]) and mapping the elapsed time to an intensity value. In time-surface methods [14-16], this mapping is frequently implemented as an exponential decay and is commonly polarity-conditioned. Such polarity separation is useful when one explicitly models signed contrast changes or directional motion cues, but it is not strictly required for edge-shape recognition problems where the goal is to detect the presence and geometry of edges. Thus, in this work, we introduce a polarity-agnostic global recency encoding—defined by a timestamp-based intensity mapping—that is explicitly tailored to edge-intensity tasks (object detection, human pose estimation, and hand posture recognition). Instead of maintaining separate recency maps for ON and OFF events, we update a single per-pixel timestamp memory with any event regardless of polarity. This avoids contour fragmentation when ON/OFF events are imbalanced (e.g., edges that predominantly generate only one polarity under certain motion/lighting conditions) and reduces memory/compute by eliminating multi-channel polarity handling.

à (Section 3) In our indoor setup, we set Ts = 100 ms and Imax = 255 for 8-bit grayscale scaling. Empirically, Ts acts as a motion-dependent temporal sensitivity knob. We found that Ts≈20 ms is suitable for near-field, fast hand-gesture motion (~1 m), whereas Ts≈100 ms provides the best trade-off for typical indoor human motion at longer range (~5 m), preserving moving-edge structure while attenuating spurious/noisy events. Accordingly, we use Ts=20 ms for hand posture recognition and Ts=100 ms for human detection and pose estimation in this work.

- reference 13: Mueggler, E.; Bartolozzi, C.; Scaramuzza, D. Fast event-based corner detection. In Proceedings of British Machine Vision Conference, London, UK, 4–7 September 2017.

- reference 14: Gehrig, D.; Loquercio, A.; Derpanis, K.; Scaramuzza, D. End-to-end learning of representations for asynchronous event-based data. In Proceedings of IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019.

- reference 15: Lagorce, X.; Orchard, G.; Galluppi, F.; Shi, B.E.; Benosman, R.B. HOTS: a hierarchy of event-based time-surfaces for pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 2017, 39, 1346–1359.

- reference 16: Sironi, A.; Brambilla, M.; Bourdis, N.; Lagorce, X.; Benosman, R. HATS: histograms of averaged time surfaces for robust event-based object classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018.

Comments 2: Insufficient Baseline Comparisons (Major Methodological Weakness); Across all experimental sections, the evaluation lacks strong, modern baselines. For instance, in the action recognition validation (Section 3), performance is compared only against “previous technique (temporal accumulation)” (page 6, Table 1) with a marginal improvement (0.908 vs. 0.896). This comparison is insufficient to support claims of superiority, as no comparison is made with Alternative event representations, Event-based deep learning models, and Spiking neural networks or hybrid CNN-SNN approaches. Similarly, in object detection and pose estimation, the manuscript reports impressive speedups (e.g., “more than 11 times speed-up”, page 8), yet comparisons are made only against conventional frame-based pipelines, not against state-of-the-art event-based detection or pose estimation frameworks. Recommendation; Include quantitative comparisons with recent event-based AI methods (2023–2025), not only frame-based CNN baselines.

Response 2: We thank the reviewer for the thorough and constructive feedback. We agree that the initial submission did not include sufficiently strong and modern baselines across all experiments, and that comparing only against temporal accumulation (Table 1) and only against frame-based pipelines (object detection/pose) is insufficient to support broad claims. Our primary contribution is not a new end-to-end action recognition / detection / pose network. Instead, we introduce an edge-oriented event-to-image representation (polarity-agnostic recency / timestamp-based image) that is designed to preserve edge-shape evidence relevant to object detection and posture/pose recognition. Therefore, the most direct and fair evaluation of the proposed method is a representation-level controlled comparison, i.e., fixing the downstream network and training protocol and changing only the event representation. At the same time, we agree with the reviewer that we should additionally provide modern (2023–2025) reference baselines to contextualize performance and compute relative to state-of-the-art event-based AI. We explicitly emphasize in the revised manuscript that many 2023–2025 state-of-the-art methods improve accuracy by scaling up model size and the number of networks (e.g., larger backbones, transformer-based architectures, dual-network setups). While this direction is effective for maximizing benchmark accuracy, it can be misaligned with edge deployment constraints (memory footprint, bandwidth, latency, and power). We hope these revisions address the reviewer’s methodological concerns and provide a clearer, fairer, and more modern baseline landscape aligned with the paper’s core objective: edge-realistic accuracy with minimal compute and latency.

à (Section 3) To validate the proposed technique, we performed the action recognition task using the human activity data set and DVS event simulator [17]. The public NTU RGB+D 120 human activity dataset includes a large-scale benchmark containing 120 action classes and 114,480 samples captured from 106 subjects [18]. The dataset provides synchronized multi-modal streams including RGB, depth, infrared (IR), and 3D skeletons (25 body joints) recorded with three Microsoft Kinect v2 cameras. On NTU RGB+D 120, Table 1 reports a representation-level ablation in which the downstream network and training protocol are fixed and only the event-to-image encoding is changed. The proposed polarity-agnostic recency image improves top-1 accuracy by from 89.6% (temporal accumulation) to 90.8% (+1.2%), indicating that recency-aware edge encoding provides more stable cues than pure event counts without altering the downstream architecture. The improvement indicates that the timestamp-based encoding provides a more informative and robust representation than temporal accumulation, especially when the motion level is low or intermittent. This is meaningful for real edge environments because the input stream is often sparse and non-stationary, and a stable representation can directly improve downstream recognition reliability without increasing the data volume. For context, recent NTU120 action-recognition literature (2023–2025) typically reports accuracies around 90–92% for strong skeleton-based methods, while multimodal approaches may reach into the low-to-mid 90% range with additional modalities and higher compute [19–24]. Because our Table 1 is designed to isolate the impact of encoding under a fixed downstream net-work (and uses event-simulated inputs), these numbers serve as a contextual benchmark level rather than a direct SOTA comparison.

- reference 17: Radomski, A.; Georgiou, A.; Debrunner, T.; Li, C.; Longinotti, L.; Seo, M.; Kwak, M.; Shin, C.; Park, P.; Ryu, H.; et al. Enhanced frame and event-based simulator and event-based video interpolation network, arXiv 2021, arXiv:2112.09379.

- reference 18: Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.-Y.; Kot, A. C. NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 2020, 42, 2684–2701.

- reference 19: Lee, J.; Lee, M.; Cho, S.; Woo, S.; Jang, S.; Lee, S. Leveraging spatio-temporal dependency for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023.

- reference 20: Zhou, Y.; Yan, X.; Cheng, Z.-Q.; Yan, Y.; Dai, Q.; Hua, X.-S. BlockGCN: redefine topology awareness for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024.

- reference 21: Reilly, D.; Das, S. Just add π! pose induced video transformers for understanding activities of daily living. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024.

- reference 22: Yang, Y. Skeleton-based action recognition with non-linear dependency modeling and hilbert-schmidt independence criterion. arXiv 2024, arXiv:2412.18780.

- reference 23: Cheng, Q.; Cheng, J.; Liu, Z.; Ren, Z.; Liu, J. A dense-sparse complementary network for human action recognition based on RGB and skeleton modalities. Expert Systems with Applications 2024, 244, 123061.

- reference 24: Abdelkawy, A.; Ali, A.; Farag, A. EPAM-Net: an efficient pose-driven attention-guided multimodal network for video action recognition. Neurocomputing 2025, 633, 129781.

à (Section 4) To contextualize compute and accuracy on modern event-based detection benchmarks, we reference RVT [31] and its recent sparse-attention variants SAST [32], which are representative state-of-the-art event-based detectors on Prophesee automotive datasets (e.g., Gen1 and 1Mp). In these benchmarks, the event sensor is mounted on a moving vehicle, so the input stream contains strong ego-motion and dense background events (roads, buildings, trees, signs), making foreground separation substantially more challenging than static-camera indoor monitoring. SAST further reports backbone compute in terms of FLOPs and “A-FLOPs” (attention-related FLOPs) averaged over test samples, showing that RVT and SAST typically operate in the 0.8–2.2G A-FLOPs regime (and higher when counting full backbone FLOPs), reflecting the need for large transformer-style backbones to achieve high mAP under automotive ego-motion. The much smaller compute in our method (81 MFLOPs) is because the problem setting and design objective are different. First, our target scenario assumes a static DVS in indoor monitoring, where background is largely suppressed and the stream is dominated by edges induced by moving subjects; therefore, the detector primarily needs to recognize edge shape presence (e.g., person vs. non-person or presence counting) rather than solve full-scene, ego-motion-compensated multi-object detection. Second, we intentionally design an ultra-lightweight backbone (e.g., aggressive channel reduction, removing layers, and using larger early strides) to minimize edge compute, which directly reduces FLOPs by an order of magnitude compared to SOTA transformer backbones. Third, in our static-camera setting the DVS naturally suppresses background, which reduces the number of candidate regions/proposals and contributes to runtime reduction beyond what FLOPs alone predicts (we observe large end-to-end latency reductions together with compute reductions). In contrast, RVT/SAST are designed for automotive ego-motion benchmarks and aim at high mAP under dense background events, which requires substantially larger capacity and attention computation even after sparsity optimization.

- reference 31: Gehrig, M.; Scaramuzza, D. Recurrent vision transformers for object detection with event cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023.

- reference 32: Peng, Y.; Li, H.; Zhang, Y.; Sun, X.; Wu, F. Scene adaptive sparse transformer for event-based object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024.

à (Section 5) Recent frame-based 2D human pose estimation commonly adopts HRNet-W48 as a strong baseline, and performance improvements are often achieved by either increasing model capacity or using multiple networks during training/inference. In particular, DE-HRNet [37] enhances HRNet-style high-resolution features by introducing detail-enhancement components and reports a modest gain on COCO test-dev (384×288): compared with HRNet-W48 (AP 0.755 with 63.6M parameters), DE-HRNet-W48 reports AP 0.757 while increasing the backbone size and compute to 74.8M parameters (which corresponds to an FP16 checkpoint size increase from roughly 127 MB to 150 MB). In a different direction, Boosting Semi-Supervised 2D HPE [38] improves accuracy primarily via stronger semi-supervised training and, for higher performance, adopts a dual-network setting (i.e., two identical yet independent networks). In their COCO test-dev comparison, the dual-network configuration effectively doubles model capacity relative to a single HRNet-W48 backbone and achieves improved accuracy (e.g., AP 0.772 in a dual setting). In contrast, our work targets event-based edge deployments (e.g., static-camera indoor monitoring) and focuses on improving the event-to-image representation (polarity-agnostic recency encoding) so that competitive accuracy and real-time operation can be achieved without scaling up the downstream network. In other words, while benchmarks exemplify the common strategy of improving COCO pose accuracy by increasing back-bone capacity or the number of networks, our approach emphasizes representation-level efficiency tailored to DVS streams and edge constraints.

- reference 37: Liu, Y.; Zhou, G.; He, W.; Zhu, H.; Cui, Y. DE-HRNet: detail enhanced high-resolution network for human pose estimation. PLoS One 2025, 20, 0325540.

- reference 38: Zhou, H.; Luo, M.; Jiang, F.; Ding, Y.; Lu, H.; Jia, K. Boosting semi-supervised 2D human pose estimation by revisiting data augmentation and consistency training. arXiv 2024, arXiv:2402.11566.

Comments 3: Questionable Generalizability of Experimental Results; The experiments are conducted under highly controlled and task-specific conditions, primarily focusing on Indoor environments, Human motion-centric scenarios, and Proprietary or internally collected datasets. For example, the object detection dataset “we utilized 19.8 M DVS images for the training” (page 9) yet no information is provided regarding Dataset availability, Reproducibility, and Cross-dataset validation. Moreover, performance metrics such as recall (>95%) and FAR (<2%) are reported without confidence intervals, statistical significance analysis, or cross-validation, raising concerns about robustness.

Recommendation; The authors should Provide statistical validation (variance, confidence intervals). Discuss limitations regarding dataset bias and deployment conditions.

Response 3: We agree that our experiments focus on indoor, human motion–centric scenarios under a static-camera setting. This is intentional and aligned with the target application of the proposed algorithm. Our method is specifically designed for home surveillance occupancy detection, i.e., determining whether a person is present or absent in a home environment (residential indoor scenes) under typical deployment constraints (always-on operation, low latency, and limited edge compute). Therefore, the appropriate notion of “generalizability” for our work is robustness across home environments and household conditions, rather than universality across fundamentally different domains such as outdoor driving/ego-motion benchmarks or multi-category detection in unconstrained scenes. The dataset used for Section 4 was collected internally in real home-surveillance settings and contains company-sensitive information and privacy-related content (e.g., residential environments). For these reasons, the full dataset cannot be publicly released. Although we cannot release raw data, we have revised the manuscript to improve reproducibility in the following ways: We clarified the exact meaning of “19.8M DVS images” as the number of recency frames (event-to-image samples) generated from the DVS stream at a fixed readout/windowing protocol, rather than conventional RGB frames. We clearly defined the train/validation/test split rules used in our experiments, including temporal/environmental separation. These changes aim to maximize methodological reproducibility while respecting confidentiality constraints. To address concerns about dataset bias and controlled conditions, we additionally validated the method on a separate dataset collected under different home conditions (e.g., different rooms and/or different time periods) that was not used for training. In addition, we emphasized that these results are deployment-condition dependent: illumination and subject–sensor distance affect event sparsity and edge contrast, which in turn changes both mean performance and variability.

à (Section 4) The proposed algorithm is primarily designed for home surveillance occupancy detection, i.e., determining whether a person is present or absent in residential indoor environments. Accordingly, our experiments focus on static-camera, indoor, human motion–centric scenarios, which reflect the intended deployment conditions (always-on operation, low latency, and edge compute constraints). In this context, for data generalization, we recorded DVS image across 10 home conditions (e.g., different rooms, illumination changes, daily activity patterns, and household layouts). Figure 7 shows representative sample images from the training data set, which were collected to reflect realistic indoor conditions such as cluttered backgrounds, different distances to the sensor, and diverse human motions. The positive human data set includes various ages, heights, genders, and clothing styles with diverse actions such as walking, jumping, duck-walking, crawling, and overlapping. This diversity is important because DVS images often contain sparse edge-like patterns rather than textured appearance; therefore, robust detection requires the model to learn motion-driven body contours that may change significantly with posture, speed, and partial occlusions. In particular, overlapping and crawling cases are challenging for home surveillance because the visible edge structures can be fragmented and the scale/aspect ratio of the human region can vary rapidly. In addition, the negative data set includes typical indoor objects and distractors such as chairs, curtains, TVs, dogs, cats, dolls, fans, and robotic vacuum cleaners. These negatives are intentionally included be-cause many household objects can generate event responses (e.g., moving fan blades, robotic vacuum motion, or pet movement), which may otherwise cause false alarms. By training with such hard-negative examples, the detector can better distinguish true human motion patterns from non-human motion and background dynamics in a practical deployment setting. In total, we utilized 19.8 M DVS images for the training. Specifically, we generated 19.8M recency frames at 10 Hz from approximately 550 hours of recordings collected across 10 rooms, three days, and 8 participants. In addition, we constructed an independent validation and test data (2M DVS images in each dataset) collected under different home conditions (e.g., different rooms and/or different time periods) that was not used for training. While the proposed approach is effective for static-camera indoor occupancy detection, its performance may degrade under conditions that violate the deployment assumptions, such as strong camera ego-motion, outdoor scenes with dense background events (e.g., wind-driven foliage, rain, strong illumination flicker), or tasks requiring fine-grained multi-class recognition across many object categories. Our method is therefore best interpreted as an edge-oriented representation and lightweight inference strategy for home surveillance. As a result, the recall accuracy and False Acceptance Rate (FAR) were measured to be >95% and <2%, respectively, indicating that the constructed dataset and training strategy are effective for reliable human detection under realistic smart-home scenarios. Under identical evaluation settings, we observe that the recall std is typically around ~0.5%, and the false alarm rate (FAR) std is typically around ~0.1%. However, these statistical values are condition-dependent and should not be interpreted as universal constants. In home surveillance, the DVS event stream varies with ambient illumination and subject–sensor distance. As a result, both the mean performance and its dispersion change across environmental buckets. For example, when the illumination is ≥10 lux and the subject is within 5 m, recall is approximately 98%; under the same illumination, recall decreases to approximately 96% when the distance increases to 5–7 m. Under dimmer illumination (5–10 lux), recall is approximately 96% within 5 m, and further drops to approximately 92% at 5–7 m. Importantly, the std also differs across these buckets, reflecting different levels of event sparsity and edge contrast under varying illumination and range.

Comments 4: Engineering Optimization vs. AI Contribution; Large parts of Sections 4–6 focus on Network pruning, Layer reduction, Mixed-bit quantization, and Stride manipulation. Example “we aggressively quarter the kernel numbers… remove convolutional layers… roll back to the previous structure” (page 8). While these optimizations are valuable from an engineering standpoint, they do not constitute novel AI algorithms. Similar optimization strategies are widely used in edge AI deployment.

Recommendation; The manuscript should either Reframe itself clearly as a systems/engineering paper, or Introduce algorithmic novelty beyond architectural compression.

Response 4: We thank the reviewer for the thoughtful comment. We agree that pruning, layer reduction, mixed-bit quantization, and stride manipulation are widely used engineering strategies for edge AI deployment and should not be interpreted as standalone algorithmic novelty. In the revised manuscript, we keep the structure of Sections 4–6 unchanged, but we explicitly clarify their role by adding a short statement at the beginning of each section emphasizing that these parts focus on Optimization for Edge AI / systems engineering considerations required for real-time deployment. We also explicitly state that we do not claim these optimizations as novel AI algorithms; they are included to document reproducible implementation choices and to meet strict latency/compute constraints. To address the reviewer’s concern about AI contribution, we strengthened the paper’s positioning at the end of the Introduction by clearly summarizing our contributions. Specifically, the core methodological (AI) contribution of this work is the polarity-agnostic recency (timestamp) encoding (Section 3), which is designed to preserve motion-induced edge shape while avoiding polarity-dependent failure cases relevant to home-surveillance occupancy-oriented recognition. We further emphasize that the recency-encoding representation is evaluated under controlled settings with fixed downstream networks, isolating the effect of event-to-image encoding from architectural compression. Sections 4–6 then provide the system-level evaluation and edge deployment context, demonstrating that the overall pipeline can operate in real time under edge constraints and reporting latency/compute/accuracy trade-offs. We believe these revisions address the reviewer’s concern by making the manuscript’s scope explicit: the paper contributes (i) an event representation method (algorithmic component) and (ii) a deployment-oriented engineering evaluation showing how standard edge optimizations enable practical, real-time operation in the targeted home-surveillance setting.

à (Introduction) The contributions of this paper are summarized as follows:

  • Polarity-agnostic recency encoding for edge-centric perception. We introduce a polarity-agnostic timestamp/recency image representation that emphasizes motion-induced edge shape while avoiding polarity-dependent failure cases, making it suitable for occupancy-oriented home surveillance tasks.
  • Controlled evaluation with fixed downstream networks. We validate the proposed representation under fixed downstream network and training protocols, isolating the effect of event-to-image encoding from architectural changes, and demonstrate that competitive recognition performance can be achieved with lightweight models.
  • Edge deployment optimization and system-level evaluation (Sections 4–6). We present a practical edge pipeline and report end-to-end latency/compute/accuracy trade-offs. The presented pruning/layer reduction, mixed-bit quantization, and stride choices are standard engineering optimizations documented for reproducibility, showing that the proposed method can operate in real time under strict edge constraints.

à (Section 4) This section focuses on system-level engineering optimizations required to meet real-time constraints in edge home-surveillance deployments. We emphasize that the following techniques—such as channel/parameter reduction, pruning-related simplifications, stride manipulation, and quantization-aware choices—are standard practices in edge AI deployment and are not claimed as standalone algorithmic novelty. Rather, they serve two purposes: (i) to demonstrate that the proposed event representation can be executed within strict latency/compute budgets, and (ii) to provide reproducible implementation details for practitioners targeting similar edge devices and always-on monitoring scenarios.

à (Section 5) Section 5 continues the deployment-oriented evaluation under edge constraints, reporting accuracy–latency trade-offs achievable with lightweight inference. The design choices presented here reflect practical engineering considerations (memory footprint, throughput, and real-time responsiveness) in home surveillance.

Comments 5: Terminology Precision; The term “bio-inspired computing method” (page 1) is used repeatedly without a clear definition or formal mapping to biological computation models.

Response 5: We thank the reviewer for highlighting the terminology issue. We agree that the phrase “bio-inspired computing” was used too broadly and could imply a formal mapping to biological computation models, which was not our intention. In the revised manuscript, we replace this term with event-based computing.

Comments 6: Clarity of Figures; Figures 5 and 7 illustrate qualitative results but lack quantitative overlays (e.g., confidence scores, failure cases).

Response 6: We agree that the original Fig. 5 and 7 was primarily qualitative. In the revised manuscript, we added quantitative overlays to Fig. 5 by explicitly reporting the RPN proposal count, which directly reflects the computational burden of proposal-based detectors. Specifically, Fig. 5(a) now indicates that the CIS-based Faster R-CNN baseline visualizes N = 300 RPN proposals per image, following the standard setting reported in the Faster R-CNN reference. Fig. 5(b) uses the DVS-specific detector described in the latter part of Section 4 and shows that the average retained proposal count is ~9, i.e., reduced by a few tens of times. Since the primary motivation for using DVS in this work is edge deployment and model light-weighting, the proposal-count reduction provides a clear quantitative explanation of why DVS is beneficial in our system. We also added clear quantitative explanation of Fig. 7 in the revised manuscript.

à (Figure 5 in Section 4) Object detection is required for home surveillance. Here, we employed DenseNet and Darknet architectures as feature extractors for human detection. The Faster R-CNN (FRCNN) structure gives the probabilistic location (region proposal) of humans. Figure 5 shows the region proposal results based on CIS and DVS images. CIS image needs huge amount of computations because it has many proposals due to background information while DVS image includes only moving foreground objects, which in turn reduces the computational cost dramatically [27]. In FRCNN [28], the Region Proposal Network (RPN) generates candidate regions, and the downstream RoI classification/regression cost scales with the number of retained proposals. Following the standard Faster R-CNN setting reported in the original paper (i.e., using N = 300 proposals per image while maintaining strong detection accuracy), we can visualize the top-300 RPN proposals for the CIS-based baseline in Figure 5a. Figure 5b uses the DVS-specific detector described in the latter part of Section 4 and shows that the average retained proposal count is ~9, i.e., reduced by a few tens of times. Since the primary motivation for using DVS in this work is edge deployment and model light-weighting, the proposal-count reduction provides a clear quantitative explanation of why DVS is beneficial in our system.

- reference 28: Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 2017, 39, 1137–1149.

à (Figure 7 in Section 4) As a result, the recall accuracy and False Acceptance Rate (FAR) were measured to be >95% and <2%, respectively, indicating that the constructed dataset and training strategy are effective for reliable human detection under realistic smart-home scenarios. Under identical evaluation settings, we observe that the recall std is typically around ~0.5%, and the false alarm rate (FAR) std is typically around ~0.1%. However, these statistical values are condition-dependent and should not be interpreted as universal constants. In home surveillance, the DVS event stream varies with ambient illumination and subject–sensor distance. As a result, both the mean performance and its dispersion change across environmental buckets. For example, when the illumination is ≥10 lux and the subject is within 5 m, recall is approximately 98%; under the same illumination, recall decreases to approximately 96% when the distance increases to 5–7 m. Under dimmer illumination (5–10 lux), recall is approximately 96% within 5 m, and further drops to approximately 92% at 5–7 m. Importantly, the std also differs across these buckets, reflecting different levels of event sparsity and edge contrast under varying illumination and range.

Comments 7: The language is generally clear, but some sections contain long descriptive paragraphs that could be condensed for clarity (e.g., Sections 4 and 6).

Response 7: We thank the reviewer for the suggestion regarding English style and readability. We agree that the original manuscript contained overly long descriptive paragraphs, particularly in Sections 4 and 6. In the revised manuscript, we condensed and rewrote these paragraphs while preserving the technical content.

à (Section 4) Specifically, we reduced channel widths (up to quartering kernels) and removed convolutional layers, reverting to the previous configuration whenever performance degraded. To further lower computation, we used larger strides in the first two layers, which reduced feature-map resolutions in later stages and thus decreased overall cost. This early downsampling also encourages a compact representation of the sparse event input, avoiding unnecessary computation on noisy fine details.

à (Section 6) Leveraging the sparsity and binary nature of DVS images, we designed a low-latency classifier with five convolutional layers and two fully connected layers (Figure 8). The convolutional stack extracts hierarchical edge/motion features from the sparse inputs, and the fully connected layers perform compact classification. We intentionally kept the model small to reduce memory access and execution depth, enabling real-time inference on low-end processors.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This paper proposes a novel event-based machine vision framework tailored for resource-constrained edge AI devices. Instead of conventional Image frame-based imagery, the authors utilize a Dynamic Vision Sensor (DVS) that captures asynchronous events (pixel-level brightness changes) to enable low-latency, energy-efficient vision processing. A timestamp-base encoding method to convert raw event streams into frame-like representations and streamlined neural network models optimized via reducing layers, enlarge convolution stride, pruning and quantization for edge inference. < !-- notionvc: 77bf31af-da9a-4019-942c-59325e9b0b76 -->

Three representative vision tasks are demonstrated: (1) human counting/object detection for smart home surveillance, (2) human pose estimation, and (3) hand posture (gesture) recognition for human–machine interaction. For each task, the paper outlines how event data is fed into adapted CNN-based algorithms.

  • For detection/Human counting, the event frames are processed by a CNN backbone and a region proposal-based head to localize humans, with architecture simplifications (fewer layers, larger strides) and mixed-precision quantization to reduce computation.
  • For pose estimation, a high-resolution network (HRNet) is pruned to remove redundant computations on sparse DVS inputs, maintaining accuracy with a much smaller model.
  • For hand posture recognition, a lightweight classifier is used

The experimental evaluation shows that the event-based approach achieves dramatic efficiency gains: e.g. over 11× speed-up in Object detection (processing time reduced from 172 ms to 15 ms, Table2 ) by using the sparse event input and optimized model; Importantly, these latency and data savings come with minimal impact on accuracy – the pruned pose model, for instance in the pose estimation, retains an mAP of 0.94 vs 0.95 for the full model. The human counting estimation task achieved >95% recall with <2% false alarm rate; The hand gesture recognition achieved 99.19% recall with only 0.09% false positives and ~14 ms inference on a low-end CPU. < !-- notionvc: 101a10c4-fde6-4fb6-ab63-eae9299baa36 -->< !-- notionvc: d3c1a069-c98a-4929-917f-044a6f6966c6 -->

Overall, the paper convincingly demonstrates that event-based vision can “reduce the data footprint... enable efficient representations... and unlock significant latency and compute savings” for edge AI applications. < !-- notionvc: ec52c26c-2f01-4a0d-8421-a0e87d85e61d -->

 

Several points that could be improved to strengthen the paper

  • Limited Handling of Static Scenes: A fundamental limitation (acknowledged by the authors) is that event-based cameras excel only with motion. The paper specifically targets “motion-centric perception” and notes that conventional frame cameras are better suited for precise, static tasks like face recognition. Consequently, if a target stops moving, a DVS provides no new information – the system might fail to detect a completely stationary object or person. The work does not extensively discuss how to handle scenarios with long static periods or very slow movements (aside from the timestamp encoding which helps to an extent). In a real surveillance application, this could be an issue – e.g., an intruder who remains motionless might evade detection until they move. This inherent trade-off between motion sensitivity and static detail is not explored in depth. Some discussion or experiments on this limitation (perhaps using a hybrid approach or fallback to a frame sensor) would strengthen the validity of the approach for all conditions.
  • Reproducibility and Data Transparency: While the paper describes the experiments clearly, reproducing the results externally may be difficult. The custom dataset used for training and testing (particularly for human detection in smart-home scenes) is not publicly provided – it is only mentioned as “available on request”. This limits other researchers’ ability to verify the results or compare algorithms on the same benchmark. Similarly, there is no mention of releasing the code or models. Key implementation details (e.g. exact network architectures after pruning, parameter settings for the encoding and training, etc.) are only partially described in the text.
  • Comparative Evaluation: The paper could have stronger comparisons to other approaches. It primarily contrasts the proposed system against a generic “conventional frame-based” baseline in terms of speed or qualitative behavior, but we see little quantitative comparison in accuracy against frame-based models or existing event-based methods. For example, the detection task reports high recall/FAR for the event method, but it’s unclear how a comparable frame-based detector would perform on the same task (perhaps with higher accuracy but much slower runtime – this trade-off is implied but not explicitly shown in a figure or table). Likewise, for pose estimation, the paper shows the pruned event-based model matches the accuracy of an original HRNet, but we don’t know how that compares to a frame-based pose estimation on the same dataset. Since event-based vision is an active field, it would have been good to see comparisons with prior event-based algorithms to underscore what new benefits are brought by the authors’ approach.

< !-- notionvc: 7de72858-643c-4671-bdb9-c69328e398f5 -->

 

 

< !-- notionvc: fb1fee02-2c2d-4fa0-ad75-0e634fe22487 -->

Author Response

Summary

Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions/corrections.

Comments 1: Limited Handling of Static Scenes: A fundamental limitation (acknowledged by the authors) is that event-based cameras excel only with motion. The paper specifically targets “motion-centric perception” and notes that conventional frame cameras are better suited for precise, static tasks like face recognition. Consequently, if a target stops moving, a DVS provides no new information – the system might fail to detect a completely stationary object or person. The work does not extensively discuss how to handle scenarios with long static periods or very slow movements (aside from the timestamp encoding which helps to an extent). In a real surveillance application, this could be an issue – e.g., an intruder who remains motionless might evade detection until they move. This inherent trade-off between motion sensitivity and static detail is not explored in depth. Some discussion or experiments on this limitation (perhaps using a hybrid approach or fallback to a frame sensor) would strengthen the validity of the approach for all conditions.

Response 1: We thank the reviewer for raising this important point. We agree that event-based cameras fundamentally respond to temporal changes, and therefore a perfectly stationary person/object may generate few or no new events. This motion–static trade-off is inherent to DVS sensing and is relevant to surveillance scenarios with long static periods. In the revised manuscript, we added a discussion at the end of the Discussion section to explicitly address this limitation in the context of our target application: home-surveillance occupancy detection. We clarify that our goal is to determine whether a person is present/absent in the home environment, and we describe a practical mitigation strategy: integrating object detection with a lightweight tracking module that operates on detected bounding boxes. With detection–tracking integration, the system can distinguish (i) a person leaving the camera view, (ii) a person entering, and (iii) a person remaining in the scene without motion, based on track continuity and motion cues. This allows the occupancy state to be maintained even when event activity becomes sparse during long static intervals. We also mention that a simple application-layer hold-time policy can further stabilize the occupancy decision during short motion pauses. We believe this added discussion clarifies how the proposed motion-centric event pipeline can be used in a realistic surveillance system that must also handle prolonged stillness, thereby strengthening the validity of the approach for practical deployments.

à (Section 7) A known limitation of event-based sensing is that perfectly stationary objects may generate few or no new events, which can reduce instantaneous evidence for purely event-driven detection during long static periods. This motion–static trade-off is inherent to DVS sensing and is particularly relevant to surveillance scenarios where an intruder might remain motionless. However, our target application is home-surveillance occupancy detection (person present/absent) rather than fine-grained static recognition. In such a system, robustness to long static periods can be strengthened by combining object detection with a lightweight tracking module [46] operating on the detected bounding boxes. Specifically, once a person is detected, a tracker can maintain and update the target state over time and distinguish among three practically important cases: (i) the target leaves the field of view (track termination near image boundaries or consistent outward motion), (ii) the target enters the field of view (track initialization with inward motion), and (iii) the target remains in the scene with little or no motion (a persistent track with minimal displacement and low event rate). These considerations highlight that, while DVS sensing is intrinsically motion-driven, a practical occupancy-detection system can explicitly handle static intervals through tracking-based state maintenance without sacrificing the low-latency, edge-efficient nature of the proposed technique. As future work, we plan to build and evaluate a complete end-to-end home-surveillance system that integrates the proposed DVS-based detection with a lightweight bounding-box tracking module, enabling more robust state reasoning (enter/leave/static) under long static periods and slow-motion scenarios.

- reference 46: Li, J.; Shi, F.; Liu, W.; Zou, D.; Wang, Q.; Park, P.K.J.; Ryu, H.E. Adaptive temporal pooling for object detection using dynamic vision sensor. In Proceedings of British Machine Vision Conference, London, UK, 4–7 September 2017.

Comments 2: Reproducibility and Data Transparency: While the paper describes the experiments clearly, reproducing the results externally may be difficult. The custom dataset used for training and testing (particularly for human detection in smart-home scenes) is not publicly provided – it is only mentioned as “available on request”. This limits other researchers’ ability to verify the results or compare algorithms on the same benchmark. Similarly, there is no mention of releasing the code or models. Key implementation details (e.g. exact network architectures after pruning, parameter settings for the encoding and training, etc.) are only partially described in the text.

Response 2: We thank the reviewer for the important comment regarding reproducibility and data transparency. The human-detection dataset was collected in real residential smart-home environments and contains privacy-sensitive and company-confidential information. It also constitutes third-party intellectual property acquired through a commercial provider. To increase reproducibility, we described in Section 4 the raw data acquisition process in detail, and we also provided detailed specifications of the model architecture, encoding parameters, and training/evaluation settings in the revised manuscript.

à (Section 4) The proposed algorithm is primarily designed for home surveillance occupancy detection, i.e., determining whether a person is present or absent in residential indoor environments. Accordingly, our experiments focus on static-camera, indoor, human motion–centric scenarios, which reflect the intended deployment conditions (always-on operation, low latency, and edge compute constraints). In this context, for data generalization, we recorded DVS image across 10 home conditions (e.g., different rooms, illumination changes, daily activity patterns, and household layouts). Figure 7 shows representative sample images from the training data set, which were collected to reflect realistic indoor conditions such as cluttered backgrounds, different distances to the sensor, and diverse human motions. The positive human data set includes various ages, heights, genders, and clothing styles with diverse actions such as walking, jumping, duck-walking, crawling, and overlapping. This diversity is important because DVS images often contain sparse edge-like patterns rather than textured appearance; therefore, robust detection requires the model to learn motion-driven body contours that may change significantly with posture, speed, and partial occlusions. In particular, overlapping and crawling cases are challenging for home surveillance because the visible edge structures can be fragmented and the scale/aspect ratio of the human region can vary rapidly. In addition, the negative data set includes typical indoor objects and distractors such as chairs, curtains, TVs, dogs, cats, dolls, fans, and robotic vacuum cleaners. These negatives are intentionally included be-cause many household objects can generate event responses (e.g., moving fan blades, robotic vacuum motion, or pet movement), which may otherwise cause false alarms. By training with such hard-negative examples, the detector can better distinguish true human motion patterns from non-human motion and background dynamics in a practical deployment setting. In total, we utilized 19.8 M DVS images for the training. Specifically, we generated 19.8M recency frames at 10 Hz from approximately 550 hours of recordings collected across 10 rooms, three days, and 8 participants. In addition, we constructed an independent validation and test data (2M DVS images in each dataset) collected under different home conditions (e.g., different rooms and/or different time periods) that was not used for training. While the proposed approach is effective for static-camera indoor occupancy detection, its performance may degrade under conditions that violate the deployment assumptions, such as strong camera ego-motion, outdoor scenes with dense background events (e.g., wind-driven foliage, rain, strong illumination flicker), or tasks requiring fine-grained multi-class recognition across many object categories. Our method is therefore best interpreted as an edge-oriented representation and lightweight inference strategy for home surveillance. As a result, the recall accuracy and False Acceptance Rate (FAR) were measured to be >95% and <2%, respectively, indicating that the constructed dataset and training strategy are effective for reliable human detection under realistic smart-home scenarios. Under identical evaluation settings, we observe that the recall std is typically around ~0.5%, and the false alarm rate (FAR) std is typically around ~0.1%. However, these statistical values are condition-dependent and should not be interpreted as universal constants. In home surveillance, the DVS event stream varies with ambient illumination and subject–sensor distance. As a result, both the mean performance and its dispersion change across environmental buckets. For example, when the illumination is ≥10 lux and the subject is within 5 m, recall is approximately 98%; under the same illumination, recall decreases to approximately 96% when the distance increases to 5–7 m. Under dimmer illumination (5–10 lux), recall is approximately 96% within 5 m, and further drops to approximately 92% at 5–7 m. Importantly, the std also differs across these buckets, reflecting different levels of event sparsity and edge contrast under varying illumination and range.

à (Section 3) To validate the proposed technique, we performed the action recognition task using the human activity data set and DVS event simulator [17]. The public NTU RGB+D 120 human activity dataset includes a large-scale benchmark containing 120 action classes and 114,480 samples captured from 106 subjects [18]. The dataset provides synchronized multi-modal streams including RGB, depth, infrared (IR), and 3D skeletons (25 body joints) recorded with three Microsoft Kinect v2 cameras. On NTU RGB+D 120, Table 1 reports a representation-level ablation in which the downstream network and training protocol are fixed and only the event-to-image encoding is changed. The proposed polarity-agnostic recency image improves top-1 accuracy by from 89.6% (temporal accumulation) to 90.8% (+1.2%), indicating that recency-aware edge encoding provides more stable cues than pure event counts without altering the downstream architecture. The improvement indicates that the timestamp-based encoding provides a more informative and robust representation than temporal accumulation, especially when the motion level is low or intermittent. This is meaningful for real edge environments because the input stream is often sparse and non-stationary, and a stable representation can directly improve downstream recognition reliability without increasing the data volume. For context, recent NTU120 action-recognition literature (2023–2025) typically reports accuracies around 90–92% for strong skeleton-based methods, while multimodal approaches may reach into the low-to-mid 90% range with additional modalities and higher compute [19–24]. Because our Table 1 is designed to isolate the impact of encoding under a fixed downstream net-work (and uses event-simulated inputs), these numbers serve as a contextual benchmark level rather than a direct SOTA comparison.

- reference 19: Lee, J.; Lee, M.; Cho, S.; Woo, S.; Jang, S.; Lee, S. Leveraging spatio-temporal dependency for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023.

- reference 20: Zhou, Y.; Yan, X.; Cheng, Z.-Q.; Yan, Y.; Dai, Q.; Hua, X.-S. BlockGCN: redefine topology awareness for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024.

- reference 21: Reilly, D.; Das, S. Just add π! pose induced video transformers for understanding activities of daily living. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024.

- reference 22: Yang, Y. Skeleton-based action recognition with non-linear dependency modeling and hilbert-schmidt independence criterion. arXiv 2024, arXiv:2412.18780.

- reference 23: Cheng, Q.; Cheng, J.; Liu, Z.; Ren, Z.; Liu, J. A dense-sparse complementary network for human action recognition based on RGB and skeleton modalities. Expert Systems with Applications 2024, 244, 123061.

- reference 24: Abdelkawy, A.; Ali, A.; Farag, A. EPAM-Net: an efficient pose-driven attention-guided multimodal network for video action recognition. Neurocomputing 2025, 633, 129781.

Comments 3: Comparative Evaluation: The paper could have stronger comparisons to other approaches. It primarily contrasts the proposed system against a generic “conventional frame-based” baseline in terms of speed or qualitative behavior, but we see little quantitative comparison in accuracy against frame-based models or existing event-based methods. For example, the detection task reports high recall/FAR for the event method, but it’s unclear how a comparable frame-based detector would perform on the same task (perhaps with higher accuracy but much slower runtime – this trade-off is implied but not explicitly shown in a figure or table). Likewise, for pose estimation, the paper shows the pruned event-based model matches the accuracy of an original HRNet, but we don’t know how that compares to a frame-based pose estimation on the same dataset. Since event-based vision is an active field, it would have been good to see comparisons with prior event-based algorithms to underscore what new benefits are brought by the authors’ approach.

Response 3: We thank the reviewer for the valuable suggestion. We agree that the initial submission could have presented stronger and more explicit comparisons to both frame-based baselines and prior event-based approaches. In the revised manuscript, we make the comparison to a frame-based detector explicit in Figure 5, where we provide a quantitative analysis using the RPN proposal count as an objective indicator of computational burden in proposal-based detectors. Specifically, Figure 5(a) visualizes the frame-based Faster R-CNN baseline using the standard top-300 RPN proposals per image, while Figure 5(b) uses our DVS-specific detector and shows that the average retained proposal count is approximately ~9, i.e., reduced by a few tens of times. Since one key motivation for using DVS in our system is lightweight edge deployment, this proposal-count reduction provides a clear quantitative explanation of the efficiency benefit, complementing the detection accuracy metrics reported in Section 4. We also strengthened comparisons to representative recent event-based approaches in both Section 4 (object detection) and Section 5 (posture-related evaluation). In these sections, we include benchmark-style comparisons and discussions of modern event-based frameworks (2023–2025) to contextualize our results, and we explicitly position our contribution as an edge-oriented representation and deployment approach.

à (Figure 5 in Section 4) Object detection is required for home surveillance. Here, we employed DenseNet and Darknet architectures as feature extractors for human detection. The Faster R-CNN (FRCNN) structure gives the probabilistic location (region proposal) of humans. Figure 5 shows the region proposal results based on CIS and DVS images. CIS image needs huge amount of computations because it has many proposals due to background information while DVS image includes only moving foreground objects, which in turn reduces the computational cost dramatically [27]. In FRCNN [28], the Region Proposal Network (RPN) generates candidate regions, and the downstream RoI classification/regression cost scales with the number of retained proposals. Following the standard Faster R-CNN setting reported in the original paper (i.e., using N = 300 proposals per image while maintaining strong detection accuracy), we can visualize the top-300 RPN proposals for the CIS-based baseline in Figure 5a. Figure 5b uses the DVS-specific detector described in the latter part of Section 4 and shows that the average retained proposal count is ~9, i.e., reduced by a few tens of times. Since the primary motivation for using DVS in this work is edge deployment and model light-weighting, the proposal-count reduction provides a clear quantitative explanation of why DVS is beneficial in our system.

- reference 28: Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 2017, 39, 1137–1149.

à (Section 3) To validate the proposed technique, we performed the action recognition task using the human activity data set and DVS event simulator [17]. The public NTU RGB+D 120 human activity dataset includes a large-scale benchmark containing 120 action classes and 114,480 samples captured from 106 subjects [18]. The dataset provides synchronized multi-modal streams including RGB, depth, infrared (IR), and 3D skeletons (25 body joints) recorded with three Microsoft Kinect v2 cameras. On NTU RGB+D 120, Table 1 reports a representation-level ablation in which the downstream network and training protocol are fixed and only the event-to-image encoding is changed. The proposed polarity-agnostic recency image improves top-1 accuracy by from 89.6% (temporal accumulation) to 90.8% (+1.2%), indicating that recency-aware edge encoding provides more stable cues than pure event counts without altering the downstream architecture. The improvement indicates that the timestamp-based encoding provides a more informative and robust representation than temporal accumulation, especially when the motion level is low or intermittent. This is meaningful for real edge environments because the input stream is often sparse and non-stationary, and a stable representation can directly improve downstream recognition reliability without increasing the data volume. For context, recent NTU120 action-recognition literature (2023–2025) typically reports accuracies around 90–92% for strong skeleton-based methods, while multimodal approaches may reach into the low-to-mid 90% range with additional modalities and higher compute [19–24]. Because our Table 1 is designed to isolate the impact of encoding under a fixed downstream net-work (and uses event-simulated inputs), these numbers serve as a contextual benchmark level rather than a direct SOTA comparison.

- reference 19: Lee, J.; Lee, M.; Cho, S.; Woo, S.; Jang, S.; Lee, S. Leveraging spatio-temporal dependency for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023.

- reference 20: Zhou, Y.; Yan, X.; Cheng, Z.-Q.; Yan, Y.; Dai, Q.; Hua, X.-S. BlockGCN: redefine topology awareness for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024.

- reference 21: Reilly, D.; Das, S. Just add π! pose induced video transformers for understanding activities of daily living. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024.

- reference 22: Yang, Y. Skeleton-based action recognition with non-linear dependency modeling and hilbert-schmidt independence criterion. arXiv 2024, arXiv:2412.18780.

- reference 23: Cheng, Q.; Cheng, J.; Liu, Z.; Ren, Z.; Liu, J. A dense-sparse complementary network for human action recognition based on RGB and skeleton modalities. Expert Systems with Applications 2024, 244, 123061.

- reference 24: Abdelkawy, A.; Ali, A.; Farag, A. EPAM-Net: an efficient pose-driven attention-guided multimodal network for video action recognition. Neurocomputing 2025, 633, 129781.

à (Section 4) To contextualize compute and accuracy on modern event-based detection benchmarks, we reference RVT [31] and its recent sparse-attention variants SAST [32], which are representative state-of-the-art event-based detectors on Prophesee automotive datasets (e.g., Gen1 and 1Mp). In these benchmarks, the event sensor is mounted on a moving vehicle, so the input stream contains strong ego-motion and dense background events (roads, buildings, trees, signs), making foreground separation substantially more challenging than static-camera indoor monitoring. SAST further reports backbone compute in terms of FLOPs and “A-FLOPs” (attention-related FLOPs) averaged over test samples, showing that RVT and SAST typically operate in the 0.8–2.2G A-FLOPs regime (and higher when counting full backbone FLOPs), reflecting the need for large transformer-style backbones to achieve high mAP under automotive ego-motion. The much smaller compute in our method (81 MFLOPs) is because the problem setting and design objective are different. First, our target scenario assumes a static DVS in indoor monitoring, where background is largely suppressed and the stream is dominated by edges induced by moving subjects; therefore, the detector primarily needs to recognize edge shape presence (e.g., person vs. non-person or presence counting) rather than solve full-scene, ego-motion-compensated multi-object detection. Second, we intentionally design an ultra-lightweight backbone (e.g., aggressive channel reduction, removing layers, and using larger early strides) to minimize edge compute, which directly reduces FLOPs by an order of magnitude compared to SOTA transformer backbones. Third, in our static-camera setting the DVS naturally suppresses background, which reduces the number of candidate regions/proposals and contributes to runtime reduction beyond what FLOPs alone predicts (we observe large end-to-end latency reductions together with compute reductions). In contrast, RVT/SAST are designed for automotive ego-motion benchmarks and aim at high mAP under dense background events, which requires substantially larger capacity and attention computation even after sparsity optimization.

- reference 31: Gehrig, M.; Scaramuzza, D. Recurrent vision transformers for object detection with event cameras. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023.

- reference 32: Peng, Y.; Li, H.; Zhang, Y.; Sun, X.; Wu, F. Scene adaptive sparse transformer for event-based object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024.

à (Section 5) Recent frame-based 2D human pose estimation commonly adopts HRNet-W48 as a strong baseline, and performance improvements are often achieved by either increasing model capacity or using multiple networks during training/inference. In particular, DE-HRNet [37] enhances HRNet-style high-resolution features by introducing detail-enhancement components and reports a modest gain on COCO test-dev (384×288): compared with HRNet-W48 (AP 0.755 with 63.6M parameters), DE-HRNet-W48 reports AP 0.757 while increasing the backbone size and compute to 74.8M parameters (which corresponds to an FP16 checkpoint size increase from roughly 127 MB to 150 MB). In a different direction, Boosting Semi-Supervised 2D HPE [38] improves accuracy primarily via stronger semi-supervised training and, for higher performance, adopts a dual-network setting (i.e., two identical yet independent networks). In their COCO test-dev comparison, the dual-network configuration effectively doubles model capacity relative to a single HRNet-W48 backbone and achieves improved accuracy (e.g., AP 0.772 in a dual setting). In contrast, our work targets event-based edge deployments (e.g., static-camera indoor monitoring) and focuses on improving the event-to-image representation (polarity-agnostic recency encoding) so that competitive accuracy and real-time operation can be achieved without scaling up the downstream network. In other words, while benchmarks exemplify the common strategy of improving COCO pose accuracy by increasing back-bone capacity or the number of networks, our approach emphasizes representation-level efficiency tailored to DVS streams and edge constraints.

- reference 37: Liu, Y.; Zhou, G.; He, W.; Zhu, H.; Cui, Y. DE-HRNet: detail enhanced high-resolution network for human pose estimation. PLoS One 2025, 20, 0325540.

- reference 38: Zhou, H.; Luo, M.; Jiang, F.; Ding, Y.; Lu, H.; Jia, K. Boosting semi-supervised 2D human pose estimation by revisiting data augmentation and consistency training. arXiv 2024, arXiv:2402.11566.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Dear Authors,

the submitted manuscript addresses a timely and relevant problem in the field of event-based perception and edge-oriented intelligent systems. The work stands out for its clearly articulated motivation, well-structured presentation, and its attempt to propose a comprehensive end-to-end framework that encompasses both the representation of event-based data and their effective use in realistic tasks such as object detection, human pose estimation, and gesture recognition. I am particularly impressed by the practical orientation of the proposed approach and by the demonstrated benefits in terms of computational complexity, latency, and scalability, which, in my view, make the study interesting and potentially useful for a broad readership. At the same time, although the core idea and the proposed methodology are conceptually convincing, I find that in several places the presentation could be substantially strengthened. Some of the key concepts on which the argumentation relies (for example, those related to “good-enough perception”) remain insufficiently formalized. In addition, the description of the experimental protocols and the configurations of the employed models is not always sufficiently detailed, which makes full reproducibility and precise interpretation of the reported results more difficult. Some of the general conclusions, particularly those related to energy efficiency and robustness of the approach, would be more convincing if they were supported by additional quantitative analyses or by more clearly defined experimental scenarios.

Although, overall, the manuscript leaves the impression of a well-executed piece of work, I unfortunately also identify a number of non-negligible shortcomings. I therefore have several comments and questions that I would like to discuss; these are described in detail in the attached file.

Comments for author File: Comments.pdf

Author Response

Summary

Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions/corrections.

Reviewer’s comment. I read with interest your paper devoted to the use of event-based machine vision for edge AI applications, and I appreciate your efforts to present a comprehensive, practically oriented framework that combines a Dynamic Vision Sensor, an appropriate representation of event data, and optimized neural networks for real-world scenarios with constraints in terms of latency, energy, and resources. Your work clearly demonstrates engineering maturity and a coherent concept of “pragmatic perception” aimed at always-on edge systems. In this context, the proposed timestamp-based representation of event data and the systematic analysis of gains in terms of computational complexity and data volume are particularly impressive. In my view, among the strong points of the paper are the clear formulation of its application driven focus, the convincing demonstration of an end-to-end pipeline for several different tasks (detection, pose estimation, and gesture recognition), as well as the original engineering contribution related to recency-based event-to-image encoding and the adaptation of architectural and numerical optimizations to the specific characteristics of DVS input. I believe that this approach represents a meaningful and well-argued extension of existing ideas in the field, with the novelty being primarily systemic and practical in nature. With regard to the individual review criteria, I can note that the introduction and abstract clearly and consistently formulate the goal of the study and the scope of the paper. The significance of the proposed work relative to prior research is well outlined through comparison with classical frame-based approaches and through an emphasis on the systemic advantages of event-based sensing. The study design and the methods employed are appropriate to the stated objectives and logically aligned with the constraints of the edge environment, while the text is clear, well structured, and easy to follow. The analysis and argumentation are consistent and generally well balanced, with the authors openly discussing the trade-offs between accuracy and efficiency. I find the cited literature to be relevant and well selected with respect to the topic of the study. I did not identify indications of plagiarism, as the text and results exhibit a distinct style and a clear authorial presence. Overall, the paper is well prepared, but it could become more readable, balanced, and better substantiated with some additional improvements and revisions. In the list below, I present my recommendations and questions.

  • Thank you very much for your kind and encouraging comments. We sincerely appreciate your careful reading and constructive recommendations, which we believe will significantly improve the clarity and technical strength of the manuscript. We have revised the paper accordingly to address your questions and to better highlight the key contributions and strengths of our work.

Comments 1: “Good-enough perception” – do you have any unified criterion or definition for this notion? Yes, indeed, you report metrics such as recall and mAP, but you do not provide an explicit definition of the above concept.

Response 1: Thank you for pointing out that the notion of “good-enough perception” was not explicitly defined. We agree that the term may appear ambiguous without an application-driven criterion. In practical occupancy-aware smart-home surveillance, there is no universally established target metric that defines “good-enough” perception, and even consensus-based standardized evaluation protocols for occupancy sensors are limited. To make this explicit, we revised the manuscript to define “good-enough perception” relative to the de-facto baseline used in current residential automation, i.e., Passive-Infra-Red (PIR)-based motion/occupancy sensors commonly deployed for entrance lighting and presence-triggered services. Long-term residential field testing of commercial occupancy-presence sensing reports 83.8% overall accuracy and 12.8% false-positive rate (FPR), and highlights failure modes under long static periods, reinforcing why “good-enough” must be defined in deployment context rather than a single universal threshold. Based on this context, we define “good-enough perception” as achieving detection reliability that is at least comparable to and preferably better than PIR-based baselines, while meeting strict edge constraints (latency/compute/memory). We added the above definition and supporting references in the revised manuscript.

  • (Section 4) To benchmark the occupancy-detection capability claimed in this paper, we use the performance of conventional Passive-Infra-Red-based (PIR-based) motion/occupancy sensors as a practical baseline for comparison. Importantly, PIR reliability is highly dependent on installation and environmental conditions; for example, reported presence-detection accuracy can be as low as ~60% under typical ceiling placement and can improve to ~84% under more favorable placement, highlighting substantial variability in real deployments [33]. Moreover, long-term field testing in a single-family home reports an overall accuracy of 83.8% with a 12.8% false-positive rate (FPR) for commercial occupancy-presence sensing, and explicitly identifies failure modes during prolonged static periods (e.g., sleep) [34]. Against this background, the high-recall and low-false-alarm operating points reported in this work (while satisfying strict edge compute/latency constraints) indicate that the proposed approach is a meaningful step beyond PIR-based baselines for occupancy-aware smart-home services.

- reference 33: Azizi, S.; Rabiee, R.; Nair, G.; Olofsson, T. Effects of positioning of multi-sensor devices on occupancy and indoor environmental monitoring in single-occupant offices. Energies 2021, 14(19), 6296.

- reference 34: Pang, Z.; Guo, M.; O’Neill, Z.; Smith-Cortez, B.; Yang, Z.; Liu, M.; Dong, B. Long-term field testing of the accuracy and HVAC energy savings potential of occupancy presence sensors in a single-family home. Energy and Buildings 2025, 328, 115161.

Comments 2: The text repeatedly emphasizes that DVS data are quantized/sparse, which leads to a lower system load (storage/throughput), and that mixed-bit quantization is also used in the networks (binary first layer, 8-bit for the remaining layers). This is excellent. However, the paper does not demonstrate that edge tasks necessarily require low resolution, low-bit, or non-color sensors; rather, this is presented more as an engineering-motivation-type claim. Personally, and likely from the readers’ perspective as well, I find myself wondering what the trade-off would be if a color and/or higher resolution sensor were used in combination with appropriate optimizations.

Response 2: Thank you for this comment. We have clarified that we do not claim edge tasks universally require low resolution or non-color sensing. Instead, for living-room occupancy sensing, privacy and user acceptance are key requirements: prior work on video-based in-home monitoring reports limited acceptance of RGB camera deployments, with privacy/intrusiveness being dominant barriers, especially for intimate situations in private spaces. We therefore motivate the use of a DVS as a privacy-aware sensing modality in addition to its efficiency: DVS produces sparse, motion-triggered edge activity (without color/texture details), and event-based sensing has been discussed as a viable direction for privacy-preserving surveillance because it mainly encodes moving boundaries while discarding redundant appearance information.

  • (Section 4) Privacy and user acceptance are primary constraints in in-home sensing. Prior studies on video-based in-home/assisted-living monitoring report that the acceptance of conventional RGB cameras can be limited, with privacy concerns and perceived intrusiveness being major barriers particularly for intimate situations that may occur in private spaces [25]. Accordingly, we employ a DVS not only for computational efficiency but also as a privacy-aware sensing modality. Event representations have been discussed as a viable direction for privacy-preserving surveillance because they mainly encode moving boundaries of the subject while discarding much of the redundant visual content [26].

- reference 25: Mujirishvili, T.; Maidhof, C.; Florez-Revuelta, F.; Ziefle, M.; Richart-Martinez, M.; Cabrero-García, J. Acceptance and privacy perceptions toward video-based active and assisted living technologies: scoping review. Journal of Medical Internet Research 2023, 25, 45297.

- reference 26: Ahmad, S.; Scarpellini, G.; Morerio, P.; Del Bue, A. Event-driven re-Id: a new benchmark and method towards privacy-preserving person re-identification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops, Waikoloa, HI, USA, 4–8 January 2022.

Comments 3: There is an important weakness that you yourselves acknowledge indirectly: during periods of little motion, simple accumulation of events yields ambiguous images—hence your proposal of timestamp-based encoding. Could you include a few sentences discussing other limitations of DVS?

Response 3: We think that, beyond the low-motion ambiguity addressed by our timestamp/recency encoding, DVS sensors have additional limitations:

- Sensitivity to illumination and high-frequency flicker: event generation depends on local contrast changes; challenging lighting (low lux), flickering sources (e.g., LED/fluorescent flicker), and rapid illumination variations can increase noise-like events or distort activity patterns.

- Hardware non-idealities: per-pixel threshold mismatch, background activity (leak events), and temporal noise can vary across sensors and operating conditions, requiring careful thresholding and filtering.

Comments 4: At least in the version of the manuscript I have seen, there are no actual measurements of energy or power consumption (e.g., W, mJ per inference), yet you claim “energy-efficient.” Could you provide some clarification for the readers on this point?

Response 4: We claim energy efficiency in the sense of reduced compute and memory access in a computing device (hence lower energy trend), not a calibrated power measurement.

Comments 5: In Section 2, Event-Based Sensing, you present a 24-hour experiment comparing the data volume generated by DVS and CIS at the same spatial resolution, illustrating the potential of event-based sensing for long-term monitoring. Although this scenario is intuitively motivated and relevant to smart home applications, in my opinion its description remains too general. More specifically, information is missing regarding the frequency and intensity of motion, the diversity of scenes and lighting conditions, and the presence of other dynamic sources (e.g., household appliances or pets). Could you clarify whether the reported results are based on a single recording or represent an average over multiple experiments?

Response 5: We thank the reviewer for the helpful request for clarification. The 24-hour experiment reported in Section 2 was based on a single continuous 24-hour recording acquired under a representative smart-home setting at the same spatial resolution for DVS and CIS. Importantly, although the figure in Section 2 visualizes one 24-hour trace for clarity, we performed additional long-duration recordings under diverse real-home conditions as described in Section 3, including scenes containing household appliances (e.g., TV/robot vacuum/fans) and pets, as well as varying indoor lighting conditions. Across these additional recordings, we observed consistent trends and obtained similar data-volume characteristics on average. For brevity and to keep the paper focused, we presented one representative 24-hour example in Section 2, while Section 3 summarizes the broader recording conditions and confirms that the overall observations generalize across typical smart-home dynamics.

Comments 6: In the same section (Section 2), it is stated that even after compression the volume of DVS data remains approximately five times smaller than that of CIS, and important system-level conclusions related to reduced CAPEX and OPEX are drawn on this basis. Please clarify the compression method used. Is the compression lossless or lossy, what parameters were applied, and whether the same algorithm and settings were used for both CIS and DVS data.

Response 6: CIS produces dense color frames, for which H.264/AVC is a de-facto standard video codec widely used in practical camera and surveillance pipelines. H.264 typically operates in a lossy mode (with inter/intra prediction and transform coding) to achieve high compression efficiency for natural video content. In our setup, CIS frames were encoded using a standard H.264 configuration under a fixed quality setting (constant-quality / rate-control configuration kept identical across recordings) so that the resulting file size reflects a realistic compressed-stream workload. Our DVS images used for the storage comparison are non-color and low-bit (bi-level/binary) representations dominated by sparse edge activity. For such bi-level images, JBIG is a standard bit-preserving (lossless) compression method designed specifically for binary image planes. Therefore, DVS images were compressed using lossless JBIG, again with a fixed encoder configuration. We think that using the same codec and settings for CIS and DVS is generally not a fair or meaningful comparison because (i) H.264 is optimized for dense natural video with temporal prediction, while (ii) JBIG is optimized for bi-level images and exploits sparse binary structure. Our goal in Fig. 3 was to quantify system-relevant compressed storage/throughput under realistic pipelines for each modality: H.264 for color CIS video, and lossless JBIG for bi-level DVS images. Importantly, even after compression under these standard choices, DVS remains substantially smaller (≈5× in our reported example), supporting the system-level conclusion that event-driven sensing reduces storage and bandwidth demands for long-term monitoring.

  • (Figure 3 Caption) Compressed data volume comparison between CIS and DVS for long-term monitoring (24 h). CIS frames (dense color video) were compressed using the standard H.264/AVC video codec (typically lossy), reflecting realistic surveillance video storage [10]. DVS outputs were converted to non-color, low-bit (bi-level/binary) event images and compressed using JBIG (lossless bi-level image compression) with a fixed encoder configuration [11]. We use modality-appropriate standard codecs because CIS and DVS data have fundamentally different structures (dense color video vs. sparse binary edge activity). Even after compression, the DVS stream remains approximately 5× smaller than the CIS stream in this representative 24-hour recording.

- reference 10: Information Technology—Coding of Audio-Visual Objects—Part 10: Advanced Video Coding. Available online: https://www.iso.org/obp/ui/#iso:std:iso-iec:14496:-10:ed-9:v1:en (accessed on 14 January 2026).

- reference 11: Progressive Bi-level Image Compression. Available online: https://www.itu.int/rec/T-REC-T.82-199303-I/en (accessed on 14 January 2026).

Comments 7: In Section 3, Timestamp-Based Image Generation, you introduce the scaling parameters ?? ??? ????, which determine, respectively, the temporal sensitivity and the intensity amplitude of the proposed timestamp-based representation. However, I did not find information in the text regarding the specific values of these parameters, how they were selected, or any analysis of the sensitivity of the results to their tuning. Could you also clarify whether the same parameters are used across different tasks and scenarios, or whether they are adapted to the specific application?

Response 7: Thank you for pointing out that the manuscript did not explicitly report the parameter values and their selection rationale. In the revised manuscript, we clarify both parameters and their roles. In our timestamp-based representation, TS controls the temporal sensitivity (recency window) and ?max sets the maximum intensity amplitude of the generated image. We set ?max = 255 to match an 8-bit grayscale representation used as the downstream network input. We set TS = 100 ms based on a practical, deployment-driven criterion: for a typical indoor distance of ~5 m and a normal human walking speed, TS = 100 ms provides the most visually faithful and stable depiction of motion-induced edges in the generated images, which is beneficial for the downstream detection/recognition tasks. We use the same TS and ?max across tasks in this study to avoid per-scenario tuning and to maintain reproducibility.

  • (Figure 4 Caption) The proposed polarity-agnostic timestamp-based image generation technique. Each pixel maintains the most recent event timestamp, and the timestamp difference T is mapped to an intensity I(T) using scaling parameters Ts and Imax, so that recently active edges become stronger while stale pixels fade. In this figure, we use Ts = 100 ms (selected to best represent the event image under a typical indoor ~5 m setup and normal human walking speed) and Imax, = 255 for 8-bit grayscale intensity scaling.

Comments 8: This is the moment to express my admiration for your work. Congratulations on not limiting yourselves to standard “temporal accumulation,” but instead introducing a clearly defined, formal encoding that satisfies condition (1). Sincere congratulations, because this is a concrete mathematical construction rather than merely a conceptual idea. It encodes the temporal recency of events, not just their count, and I am genuinely impressed. Well done!

Response 8: We sincerely thank the reviewer for the encouraging and thoughtful feedback.

Comments 9: How do you determine the sampling time for image generation? In the text you state “at a chosen sampling time,” but how is this moment actually selected?

Response 9: Thank you for the question. In our implementation, the chosen sampling time refers to the image-generation timestamp ?? at which we render a recency image from the incoming event stream. We select ?? using a fixed-rate sampling schedule to ensure deterministic latency and reproducibility: the encoder generates one image every Δ? seconds, i.e., ?? = ?0 + ?Δ?, where Δ? is the frame period of the downstream network (e.g., Δ? = 100 ms corresponding to 10 fps). The same Δ? is used across experiments unless explicitly stated.

Comments 10: The proposed timestamp-based representation introduces the parameter ??, which controls the temporal sensitivity of the intensity encoding and thus can potentially affect the robustness of the method to noise and spurious events in the event stream. From my perspective-and presumably also from that of future readers-a discussion of the tradeoff between temporal resolution and noise robustness is missing, as well as an analysis of the behavior of the proposed encoding in the presence of noisy or sporadic events.

Response 10: Thank you for this important comment. We agree that the parameter ??, which controls the temporal sensitivity of the timestamp/recency encoding, also affects robustness to noise and sporadic events in the event stream. In general, as ?? increases, motion-induced edges become more clearly connected because recent activity persists longer; however, the same persistence also increases afterimage/ghosting and can make the representation more susceptible to background activity and spurious events, since noise-like events remain visible for a longer duration. Conversely, a smaller ?? reduces persistence (thus suppressing isolated spurious events more quickly) but may fragment edges for slower motions. In our experiments, we selected ?? based on qualitative validation during system development. For fast applications such as gesture/action recognition, where motion dynamics are rapid, a shorter temporal sensitivity (e.g., around ?? ≈ 20 ms) yields a sharper, less persistent representation. For human detection/occupancy sensing, typical human walking speed is substantially slower than hand motion; we found that ?? = 100 ms provided the best practical balance and the strongest performance in our indoor setting by producing stable edge shapes while maintaining acceptable robustness. We acknowledge that we did not include a dedicated quantitative ablation study of ?? in the current manuscript; rather, ?? was chosen through qualitative experimentation to best match the motion characteristics of the target application.

Comments 11: Again with respect to the same section (Section 3): the exponential function is introduced directly in equation (1), but the choice of this function is not justified. What is the motivation for selecting it—biological inspiration, optimality considerations, or empirical selection? It might also be useful to include comparisons with other mapping functions (linear, power-law, piecewise, etc.).

Response 11: Thank you for the constructive comment. Our intention was not to claim that the exponential is the only valid choice, but to adopt a standard, interpretable, and implementation-friendly recency mapping. First, exponentially decaying time-surface/recency-surface representations can be used in event-based vision to encode temporal recency, including hierarchical time-surface methods and related time-surface formulations [13]. From a modeling perspective, the exponential decay corresponds to a simple first-order leaky integrator with a single time constant ??, providing a smooth, monotonic fade that avoids discontinuities and offers a clear physical interpretation of temporal sensitivity. We also acknowledge that other monotonic mappings (linear, power-law, piecewise functions, etc.) are possible. In our work, the exponential mapping was selected primarily for its interpretability and empirical stability in representing motion-induced edges.

- reference 13: Mueggler, E.; Bartolozzi, C.; Scaramuzza, D. Fast event-based corner detection. In Proceedings of British Machine Vision Conference, London, UK, 4–7 September 2017.

Comments 12: In Section 4, Object Detection, you state that the DenseNet and Darknet architectures were used for feature extraction, but I could not find any motivation for this choice. Could you specify which particular variants or configurations of these architectures were employed? Did you consider alternative backbone networks that are more commonly used in edge scenarios (e.g., MobileNet or other lightweight CNN architectures)?

Response 12: Thank you for the comment. Our paper does not claim novelty in the backbone itself; rather, the contribution is an edge-oriented event representation and a practical, compressed detection procedure. We selected Densnet/Darknet-style convolutional blocks as a practical starting point for the event-based detector because they are composed of regular convolutional layers that are particularly amenable to aggressive channel/layer reduction and mixed-bit quantization. We agree that MobileNet-family backbones are widely used for edge scenarios. Our approach is backbone-agnostic: the proposed edge-oriented optimizations can be applied to MobileNet backbones as well (e.g., MobileNet as the feature extractor within a detector head).

Comments 13: You present examples (Figure 5) and claim that CIS produces many proposals due to background clutter, but it is not specified whether the pipeline and parameters are strictly comparable. Is the CIS vs. DVS comparison for region proposals fair (i.e., are the settings and conditions identical, how is the input normalized, and is the same detector used)? This may well be the case, but please clarify it explicitly.

Response 13: Thank you for the comment. We clarify that Figure 5 is intended as an illustrative example to explain the qualitative difference in region-proposal behavior between CIS and DVS inputs. In this figure, the capture conditions for CIS and DVS images are not identical, but importantly, the same detector basis (same RPN-based proposal mechanism) is used for both inputs. In Faster R-CNN detectors, the RPN generates on the order of hundreds of candidate regions (commonly ~300 proposals, many of which are driven by background texture/clutter). In contrast, the DVS input inherently suppresses static background appearance and mainly contains motion-induced edge activity; therefore, the RPN produces far fewer proposals. In the example shown in Figure 5, the DVS-based input yields approximately 9 proposals, illustrating how background removal in event-based imagery can substantially reduce unnecessary proposal processing and thereby support edge-efficient deployment.

  • (Section 4) The Faster R-CNN (FRCNN) structure gives the probabilistic location (region proposal) of humans. Figure 5 shows the region proposal results based on CIS and DVS images. CIS image needs huge amount of computations because it has many proposals due to background information while DVS image includes only moving foreground objects, which in turn reduces the computational cost dramatically [27]. In FRCNN [28], the Region Proposal Network (RPN) generates candidate regions, and the downstream RoI classification/regression cost scales with the number of retained proposals. Following the standard Faster R-CNN setting reported in the original paper (i.e., using N = 300 proposals per image while maintaining strong detection accuracy), we can visualize the top-300 RPN proposals for the CIS-based baseline in Figure 5a. Figure 5b uses the DVS-specific detector described in the latter part of Section 4 and shows that the average retained proposal count is ~9, i.e., reduced by a few tens of times. Since the primary motivation for using DVS in this work is edge deployment and model light-weighting, the proposal-count reduction provides a clear quantitative explanation of why DVS is beneficial in our system.

Comments 14: How exactly was the experimental evaluation organized (dataset split, test scenarios, definitions of the recall/FAR metrics, thresholds, confidence/IoU, mAP, etc.)? Admittedly, there is a qualitative description of the training data (positives/negatives, examples) and quantitative statements such as 19.8M DVS images used for training, recall >95%, FAR <2%. However, key details are missing: what constitutes the test set, how FAR is measured (definition/threshold), under what conditions the reported percentages were obtained, whether cross-validation was used, how many scenes/homes were involved, and so on. You understand what I mean—the methodology remains insufficiently specified.

Response 14: Thank you for the helpful comment. We agree that the evaluation methodology required clearer specification. In the revised manuscript, we added a concise “Experimental Protocol” description in the main text and also clarified key settings in the caption of Figure 7. Briefly, the dataset is organized into three categories (human, animal, and others). We report accuracy on the test set as the probability of correctly predicting the human class. We define FAR using negative frames (frames labeled as non-human, i.e., animal or others) as the proportion of negatives incorrectly predicted as human.

  • (Section 4) The proposed algorithm is primarily designed for home surveillance occupancy detection, i.e., determining whether a person is present or absent in residential indoor environments. Accordingly, our experiments focus on static-camera, indoor, human motion–centric scenarios, which reflect the intended deployment conditions (always-on operation, low latency, and edge compute constraints). In this context, for data generalization, we recorded DVS image across 10 home conditions (e.g., different rooms, illumination changes, daily activity patterns, and household layouts). Figure 7 shows representative sample images from the training data set, which were collected to reflect realistic indoor conditions such as cluttered backgrounds, different distances to the sensor, and diverse human motions. The positive human data set includes various ages, heights, genders, and clothing styles with diverse actions such as walking, jumping, duck-walking, crawling, and overlapping. This diversity is important because DVS images often contain sparse edge-like patterns rather than textured appearance; therefore, robust detection requires the model to learn motion-driven body contours that may change significantly with posture, speed, and partial occlusions. In particular, overlapping and crawling cases are challenging for home surveillance because the visible edge structures can be fragmented and the scale/aspect ratio of the human region can vary rapidly. In addition, the negative data set includes typical indoor objects and distractors such as chairs, curtains, TVs, dogs, cats, dolls, fans, and robotic vacuum cleaners. These negatives are intentionally included be-cause many household objects can generate event responses (e.g., moving fan blades, robotic vacuum motion, or pet movement), which may otherwise cause false alarms. By training with such hard-negative examples, the detector can better distinguish true human motion patterns from non-human motion and background dynamics in a practical deployment setting. In total, we utilized 19.8 M DVS images for the training. Specifically, we generated 19.8M recency frames at 10 Hz from approximately 550 hours of recordings collected across 10 rooms, three days, and 8 participants. In addition, we constructed an independent validation and test data (2M DVS images in each dataset) collected under different home conditions (e.g., different rooms and/or different time periods) that was not used for training. While the proposed approach is effective for static-camera indoor occupancy detection, its performance may degrade under conditions that violate the deployment assumptions, such as strong camera ego-motion, outdoor scenes with dense background events (e.g., wind-driven foliage, rain, strong illumination flicker), or tasks requiring fine-grained multi-class recognition across many object categories. Our method is therefore best interpreted as an edge-oriented representation and lightweight inference strategy for home surveillance. As a result, the recall accuracy and False Acceptance Rate (FAR) were measured to be >95% and <2%, respectively, indicating that the constructed dataset and training strategy are effective for reliable human detection under realistic smart-home scenarios. Under identical evaluation settings, we observe that the recall std is typically around ~0.5%, and the false alarm rate (FAR) std is typically around ~0.1%. However, these statistical values are condition-dependent and should not be interpreted as universal constants. In home surveillance, the DVS event stream varies with ambient illumination and subject–sensor distance. As a result, both the mean performance and its dispersion change across environmental buckets. For example, when the illumination is ≥10 lux and the subject is within 5 m, recall is approximately 98%; under the same illumination, recall decreases to approximately 96% when the distance increases to 5–7 m. Under dimmer illumination (5–10 lux), recall is approximately 96% within 5 m, and further drops to approximately 92% at 5–7 m. Importantly, the std also differs across these buckets, reflecting different levels of event sparsity and edge contrast under varying illumination and range.
  • (Figure 7 Caption) Sample images of the DVS training data set for home surveillance. The dataset contains diverse positive human cases (various actions and occlusions, including overlapping) and hard negative indoor cases (e.g., pets and moving appliances), enabling robust detection with high recall (>95%) and low FAR (<2%). Our dataset was collected and labeled into three categories: human, animal, and others (e.g., household objects/background dynamics). The evaluation is performed on a held-out test set. In this work, we report accuracy on the test set as the probability of correctly predicting the human class. We define FAR using negative frames (frames labeled as non-human, i.e., animal or others) as the proportion of negatives incorrectly predicted as human.

Comments 15: I am currently reading Section 5, Human Pose Estimation, and I find myself wondering which specific dataset was used for pose estimation and under what experimental conditions. There is a bibliographic reference to the DHP19 dataset in the list of references, but this is not equivalent to a clearly described experimental setup in the text. Please specify which dataset and evaluation protocol were used for pose estimation (train/test split, metrics, settings).

Response 15: Thank you for pointing out that the pose-estimation dataset and evaluation protocol were not described with sufficient clarity. We have revised Section 5 (Human Pose Estimation) to explicitly specify the dataset, splits, and metrics. Concretely, we use the MS COCO 2017 Key-point Detection benchmark in a top-down single-person setting, where each annotated person instance is cropped and treated as one single-person training sample (i.e., instance-based single-person formulation). We added the commonly used COCO-2017 key-point split statistics (train/val key-point-labeled images and the corresponding number of person instances) and clarified that the RGB crops are converted into synthetic event streams using an event simulator.

  • (Section 5) Here, we used the MS COCO 2017 key-point detection benchmark. The COCO key-point annotations define 17 anatomical key-points for each person. In our experiments, we adopted a top-down single-person setting, where each training sample corresponded to a single-person instance crop extracted from the COCO images (i.e., the multi-person images were converted into single-person training instances by cropping per annotated person). In terms of official COCO-2017 splits, the key-point task is based on train2017/val2017 [36]; for the key-point-labeled subset, commonly used splits include 56,599 training images and 2,346 validation images with key-point annotations. In the top-down instance-based formulation, this corresponds to approximately 149,813 person instances for training and 6,352 person instances for validation, where each person instance is treated as one single-person sample. Because our sensing modality was event-based, we converted the COCO single-person crops into synthetic event streams using an event simulator [17]. We evaluated pose estimation using the COCO key-point evaluation protocol based on Object Key-point Similarity (OKS), which plays a role analogous to IoU in detection. A predicted pose is matched to a ground-truth pose and considered correct if OKS ≥ 0.5. To design key-points estimation network models for pose recognition, we utilized HRNet because its performance is superior to other networks [36]. HRNet is effective in maintaining high-resolution representations and fusing multi-scale features, which generally improves localization accuracy of body joints. However, because original HRNet has lots of redundant computations for processing DVS images, we pruned the backbone network to reduce the network size as described in Section 4.

- reference 36: Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation, In Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019.

Comments 16: Again regarding Section 5: you explain, in general terms, what you reduced, but you do not provide an exact specification/configuration of the resulting model (e.g., the specific HRNet variant, the pruning scheme for the modules, a layer-by-layer configuration table, etc.). Could you provide this?

Response 16: Thank you for the comment. We agree that the previous manuscript does not provide a full layer-by-layer specification of the final compressed pose-estimation model. Our baseline backbone is HRNet-W48, and we derived a lightweight variant from HRNet-W48 through the model reduction/compression steps described in Section 5. However, due to personal constraints at this time, we are not able to provide the exact final configuration (e.g., a full pruning table or module-by-module specification). We have therefore kept the description at the methodological level and report the resulting performance and efficiency metrics to maintain transparency regarding the achieved trade-offs.

Comments 17: Nevertheless, congratulations for not merely “lightweighting” the network, but for explicitly analyzing which structural parameters have the greatest impact on accuracy. You perform selective pruning rather than blind reduction.

Response 17: Thank you very much for the positive feedback.

Comments 18: Do you have validation on real low-end edge hardware for pose estimation (and not only on a GPU), and/or an analysis of robustness with respect to occlusions, varying motion speeds, and noise in the event stream?

Response 18: Thank you for the comment. At this stage, our work has not yet progressed to a full commercialization-ready deployment with pose-estimation validation on real low-end edge hardware. We verified the lightweighting/compression effects primarily at the GPU level (model size and runtime trends) as an intermediate development step. In addition, we did not conduct a dedicated robustness study covering diverse test conditions such as occlusions, varying motion speeds, or event-stream noise beyond the qualitative observations discussed in the paper.

Comments 19: In Section 6, you report an extremely low False Acceptance Rate (FAR = 0.0926%), which I also consider to be a very important indicator of system reliability for HMI applications. However, the text lacks clarity regarding how FAR is defined and measured. Please describe whether the evaluation was performed at the frame level or at the sequence level, which confidence threshold was used for classification, and under what experimental conditions the reported value was obtained.

Response 19: Thank you for the comment. In our HMI experiment, the classification task is intentionally simple with three gesture categories (Rock–Paper–Scissors), which contributes to a low FAR under the tested conditions. We measure FAR at the sequence level (not per-frame). Each test sample corresponds to a gesture sequence, and the classifier outputs per-frame confidence scores that are aggregated into a single sequence-level confidence (majority voting). A sequence is accepted only when the maximum class confidence exceeds a predefined threshold. We define false acceptance as the case where a non-target (negative) sequence is incorrectly accepted as one of the gesture commands. Accordingly, FAR is computed as: FAR = NFP/Nneg x 100(%) where NFP is the number of non-matching (ground-truth ≠ predicted) sequences that are accepted, and Nneg is the total number of negative sequences in the test set. Specifically, each decision is made from a 10-frame sequence, and we set a voting threshold over these 10 frames to minimize FAR (i.e., a gesture is accepted only if sufficient per-frame votes/support are accumulated within the 10-frame window).

Comments 20: In the same section (Section 6), you state that the proposed hand pose classification works reliably under intermittent motion, varying motion speeds, and the presence of partial edge structures due to changes in viewpoint. However, for the readers these may remain merely claims. They are presented at a qualitative level and are not supported by a dedicated experimental analysis or quantitative results for the corresponding scenarios. Could you provide an analysis that substantiates these statements?

Response 20: Thank you for the comment. In our system, reliability is primarily achieved by minimizing false acceptances via sequence-level decision making rather than frame-level classification. This sequence-level voting mitigates spurious per-frame fluctuations caused by intermittent motion, varying motion speeds, and partial edge structures, thereby improving stability in practical use. To further support these statements qualitatively, we have also attached a separate demo video illustrating representative test cases under intermittent motion and viewpoint variations.

Comments 21: Again in Section 6, you mention your previous work based on optical flow and gesture estimation as motivation for developing the compact posture classifier. However, you do not provide a quantitative comparison between the proposed approach and opticalflow-based methods, nor with other existing baseline solutions for hand posture recognition. Please include an additional paragraph addressing this aspect.

Response 21: Thank you for the helpful suggestion. We agree that, for HMI, it is important to discuss how the proposed posture-based classifier relates to motion/optical-flow-based gesture pipelines. In the revised manuscript, we added an additional paragraph in Section 6 clarifying this point and included appropriate references. Specifically, we emphasize that HMI reliability is not only determined by accuracy but also by False Acceptance Rate (FAR), because false triggers directly degrade user experience and safety. We explain that optical-flow-based event-driven gesture inference can be intrinsically sensitive to the aperture problem and to spurious/noise events (and also to mechanical vibration), which can increase false triggers unless additional regularization and stabilization are applied. In contrast, our method is posture/edge-shape–centric and uses sequence-level majority voting to explicitly minimize FAR in an always-on setting. We also note that many event-camera gesture-recognition works primarily report classification accuracy and do not provide an explicit FAR under negative sequences; therefore, we report FAR as a key system-level metric for HMI deployment.

  • (Section 6) In our previous work [39], we proposed a low-latency optical flow and gesture estimation algorithm. While optical-flow-based approaches are effective for dynamic gestures, real edge deployments often benefit from a compact posture classifier that can operate reliably even when (i) motion is intermittent, (ii) the hand moves at different speeds, and (iii) only partial edges are available due to viewpoint changes or occlusions. Specifically, optical-flow-based methods inherently face the aperture problem, where local edge motion constrains only the normal component, requiring spatial/temporal regularization that can be sensitive to spurious events, viewpoint changes, and mechanical vibration—factors that may increase false triggers in always-on HMI settings. In contrast, the present work intentionally adopts a compact posture/edge-shape classifier (rather than a velocity-field estimator) and evaluates reliability using sequence-level FAR with majority voting over a short temporal window. This posture-centric formulation reduces the dependence on stable flow estimation and thus is better aligned with HMI requirements where low false acceptance is as critical as accuracy. While many event-camera gesture-recognition studies primarily report classification accuracy (and often omit an explicit FAR definition under negative sequences [40]), we explicitly report FAR for system-level reliability in practical deployment.

- reference 40: Lee, J.-H.; Delbruck, T.; Pfeiffer, M.; Park, P.K.J.; Shin, C.-W.; Ryu, H.; Kang, B.C. Real-time gesture interface based on event-driven processing from stereo silicon retinas. IEEE Transactions on Neural Networks and Learning Systems 2014, 25, 2250–2263.

Comments 22: In several sections (e.g., Sections 2 and 6), the phrase “We had recently proposed …” is used. This tense (past perfect) is unnatural in a scientific context; “We recently proposed …” or “In our previous work, we proposed …” would be more appropriate.

Response 22: Thank you for the careful reading. We agree that the phrasing “We had recently proposed …” is unnatural in a scientific writing style. In the revised manuscript, we replaced this expression throughout the paper with standard academic wording such as “We recently proposed …” or “In our previous work, we proposed …” (with appropriate citations), depending on the specific context.

Comments 23: Perhaps I am misunderstanding something, but it seems to me that reference [14] is cited in connection with HRNet, whereas it actually discusses a different architecture (a vision transformer ?), which should be checked and corrected. HRNet was originally introduced by Sun et al., CVPR 2019 (which corresponds to [13]). Reference [14] is not HRNet, but a different, transformer-based model.

Response 23: Thank you for pointing this out. We agree that reference [14] describes a transformer-based architecture and is not the original HRNet work. In the revised manuscript, we have checked all occurrences where HRNet is discussed and corrected the citations by replacing the incorrect reference with the original HRNet paper by Sun et al., CVPR 2019.

 

I would like to express my sincere gratitude to the authors for this inspiring research. This article sparked excitement and interest in me as I read it. All the questions and recommendations posed are made with the intention of contributing to the improvement of the article and highlighting its strengths.

Response: Thank you very much for your kind and encouraging comments. We sincerely appreciate your careful reading and constructive recommendations, which we believe will significantly improve the clarity and technical strength of the manuscript. We have revised the paper accordingly to address your questions and to better highlight the key contributions and strengths of our work.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

Although the research topic is worth investigating, the current version of the paper is far from publishable because many parts are not detailed. I would like to reconsider the paper after a major revision.
Comment 1. In the highlights, those points were not concise and emphasized the contributions and benefits.
Comment 2. Abstract:
(a) The research contributions and methodologies are not clearly summarized.
(b) Results are missing.
Comment 3. Keywords: More terms should be included to better reflect the scope of the paper. The maximum number of terms is 10.
Comment 4. Section 1 Introduction:
(a) Strengthen the importance of the research topic.
(b) Clearly state the research contributions of the paper.
Comment 5. The organization was poor. Please consider reorganising the sections to provide a literature review in Section 2 (but a literature review was missing in the current version of the paper), a methodology in Section 3, and performance evaluation and comparison in Section 4.
Comment 6. Enhance the resolution of all figures. Zoom in on your file to confirm that no content is blurred.
Comment 7. Limited details were presented for the methodology, including equations, pseudo-code, etc.
Comment 8. Results for fine-tuning the model were missing.
Comment 9. Compare your method with existing methods.

Author Response

Summary

Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions/corrections.

Reviewer’s comment. Although the research topic is worth investigating, the current version of the paper is far from publishable because many parts are not detailed. I would like to reconsider the paper after a major revision.

  • Thank you for the overall assessment. We agree that the previous version did not provide sufficient technical detail and therefore required a major revision. In response, we have substantially revised the manuscript to improve completeness and reproducibility: (i) we expanded Section 2 with a dedicated literature review to properly position our work, (ii) we clarified Section 3 as the methodology section by adding explicit formulations and implementation details (including key parameters and algorithmic description), and (iii) we strengthened Section 4 with more comprehensive performance evaluation and comparisons against existing approaches. We also streamlined the Highlights/Abstract, expanded the keywords, and addressed figure quality by verifying legibility under zoom, while clarifying unavoidable sensor-resolution limits and motion blur in the CIS example. We believe these changes constitute a major revision that addresses the reviewer’s concerns and significantly improves the manuscript’s clarity, rigor, and publishability.

Comments 1: In the highlights, those points were not concise and emphasized the contributions and benefits.

Response 1: Thank you for the comment. We agree that the original highlights statements were not sufficiently concise and did not clearly emphasize the key contributions and benefits. In the revised manuscript, we rewrote these items.

  • (Highlights) What are the main findings?
    • We present an event-image representation that preserves moving-edge structure while reducing data volume for downstream processing.
    • The proposed event-based edge-AI computing achieves an 11× speed-up for human detection and pose estimation.
  • (Highlights) What are the implications of the main findings?
    • The approach enables privacy-friendly, always-on home occupancy sensing under tight edge constraints.
    • Combining event encoding with compact models is an effective deployment recipe for motion-centric edge AI tasks.

Comments 2: Abstract:

(a) The research contributions and methodologies are not clearly summarized.

(b) Results are missing.

Response 2: Thank you for the comment. We agree that the original Abstract did not clearly summarize the key contributions and lacked sufficient quantitative results. In the revised Abstract, we explicitly state the main methodological contributions (timestamp-based, polarity-agnostic recency encoding and edge-oriented network optimizations) and add representative quantitative outcomes, including the data-volume reduction results and task-level performance/efficiency metrics (e.g., accuracy improvements, speed-up/latency, FAR, model size, and FLOPs), to substantiate the claims.

  • (Abstract) Event-based sensors provide sparse, motion-centric measurements that can reduce data bandwidth and enable always-on perception on resource-constrained edge devices. This paper presents an event-based machine-vision framework for smart-home AIoT that couples a Dynamic Vision Sensor (DVS) with compute-efficient algorithms for (i) human/object detection, (ii) 2D human pose estimation, and (iii) hand posture recognition for human-machine interfaces. The main methodological contributions are a timestamp-based, polarity-agnostic recency encoding that preserves moving-edge structure while suppressing static background, and task-specific network optimizations (architectural reduction and mixed-bit quantization) tailored to sparse event images. With a fixed downstream network, the recency encoding improves action-recognition accuracy over temporal accumulation (0.908 vs. 0.896). In a 24-hour indoor monitoring experiment (640x480), the raw DVS stream is about 30x smaller than conventional CMOS video and remains about 5x smaller after standard compression. For human detection, the optimized event processing reduces computation from 5.8 G to 81 M FLOPs and runtime from 172 ms to 15 ms (more than 11x speed-up). For pose estimation, a pruned HRNet reduces model size from 127 MB to 19 MB and inference time from 70 ms to 6 ms on an NVIDIA Titan X while maintaining comparable accuracy (mAP from 0.95 to 0.94) on MS COCO 2017 using synthetic event streams generated by an event simulator. For hand posture recognition, a compact CNN achieves 99.19% recall and 0.0926% FAR with 14.31 ms latency on a single i5-4590 CPU core using 10-frame sequence voting. These results indicate that event-based sensing combined with lightweight inference is a practical approach to privacy-friendly, real-time perception under strict edge constraints.

Comments 3: Keywords: More terms should be included to better reflect the scope of the paper. The maximum number of terms is 10.

Response 3: Thank you for the comment. We agree that the keyword list was too limited. In the revised manuscript, we expanded the Keywords to the maximum of 10 terms to better reflect the scope of the paper, including the proposed timestamp-based encoding and the three target tasks (human detection, pose estimation, and hand posture recognition).

  • (Keywords) Dynamic Vision Sensor; Event-Based Vision; Edge AI; Neuromorphic; Timestamp-Based Encoding; Polarity-Agnostic Event Representation; Home Occupancy Sensing; Human Detection; Human Pose Estimation; Hand Posture Recognition

Comments 4: Section 1 Introduction:

(a) Strengthen the importance of the research topic.

(b) Clearly state the research contributions of the paper.

Response 4: Thank you for the comment. In the revised manuscript, we strengthened the Introduction to more clearly emphasize the importance of always-on smart-home/AIoT perception under stringent edge constraints. In addition, to explicitly clarify what is new in this work, we added a dedicated contributions item at the end of the Introduction that succinctly enumerates the main contributions of the paper.

  • (Introduction) The contributions of this paper are summarized as follows:
  • Polarity-agnostic recency encoding for edge-centric perception. We introduce a polarity-agnostic timestamp/recency image representation that emphasizes motion-induced edge shape while avoiding polarity-dependent failure cases, making it suitable for occupancy-oriented home surveillance tasks.
  • Controlled evaluation with fixed downstream networks. We validate the proposed representation under fixed downstream network and training protocols, isolating the effect of event-to-image encoding from architectural changes, and demonstrate that competitive recognition performance can be achieved with lightweight models.
  • Edge deployment optimization and system-level evaluation (Sections 4–6). We present a practical edge computing flow and report end-to-end latency/compute/accuracy trade-offs. The presented pruning/layer reduction, mixed-bit quantization, and stride choices are standard engineering optimizations documented for reproducibility, showing that the proposed method can operate in real time under strict edge constraints.

Comments 5: The organization was poor. Please consider reorganising the sections to provide a literature review in Section 2 (but a literature review was missing in the current version of the paper), a methodology in Section 3, and performance evaluation and comparison in Section 4.

Response 5: Thank you for the constructive suggestion regarding the manuscript organization. In the revised manuscript, we added a dedicated literature review in Section 2 (Related Work) to summarize key event-based vision surveys and position our work relative to prior studies. We also clarified and consolidated the methodology in Section 3, explicitly describing the proposed encoding and system design choices. Finally, we revised Section 4 to include expanded performance evaluation and comparisons, with clearer descriptions of experimental settings, metrics, and baseline comparisons.

  • (Section 2 – Literature Review) Event-based vision and event cameras have been comprehensively reviewed in several surveys. Gallego et al. provide a foundational overview of event-camera sensing principles and core advantages (asynchronous measurements, high temporal resolution, and high dynamic range), and summarize representative processing techniques from low-level vision (e.g., feature tracking and optical flow) to high-level tasks (e.g., recognition and detection), including common event representations and learning-based approaches [2]. More recently, Cimarelli et al. present a broad review that integrates hardware evolution, algorithmic progress, and real-world applications of neuromorphic/event cameras in a unified structure, while also discussing practical challenges and adoption barriers relevant to deployment [3]. In addition, Cazzato and Bono provide an application-driven survey that organizes event-based computer vision methods by domain and highlights key achievements and open issues across application areas [4]. Finally, Chakravarthi et al. survey recent event-camera innovations, including sensor-model developments and commonly used datasets/simulators that support benchmarking and system validation [5]. In contrast to these survey papers, which broadly summarize event-camera principles and general-purpose algorithms across diverse domains, this work focuses on a deployment-oriented smart-home setting and provides a concrete end-to-end frame-work—timestamp-based encoding plus edge-optimized inference—for always-on occupancy-related tasks with explicit system-level evidence (data volume, latency, FLOPs/model size, and accuracy).
  • (Section 3 - Methodology) Thus, in this work, we introduce a polarity-agnostic global recency encoding—defined by a timestamp-based intensity mapping—that is explicitly tailored to edge-intensity tasks (object detection, human pose estimation, and hand posture recognition). Instead of maintaining separate recency maps for ON and OFF events, we update a single per-pixel timestamp memory with any event regardless of polarity. This avoids contour fragmentation when ON/OFF events are imbalanced (e.g., edges that predominantly generate only one polarity under certain motion/lighting conditions) and reduces memory/compute by eliminating multi-channel polarity handling. The key idea is to store the latest timestamp at each pixel and convert the time difference into an intensity value. In other words, each pixel intensity represents the recency of activity rather than the number of events. This approach can preserve important temporal cues while still providing a frame-like representation that can be directly used by conventional Convolutional Neural Network (CNN)-based techniques. (…) To validate the proposed technique, we performed the action recognition task using the human activity data set and DVS event simulator [17]. The public NTU RGB+D 120 human activity dataset includes a large-scale benchmark containing 120 action classes and 114,480 samples captured from 106 subjects [18]. The dataset provides synchronized multi-modal streams including RGB, depth, infrared (IR), and 3D skeletons (25 body joints) recorded with three Microsoft Kinect v2 cameras. On NTU RGB+D 120, Table 1 reports a representation-level ablation in which the downstream network and training protocol are fixed and only the event-to-image encoding is changed. The proposed polarity-agnostic recency image improves top-1 accuracy by from 89.6% (temporal accumulation) to 90.8% (+1.2%), indicating that recency-aware edge encoding provides more stable cues than pure event counts without altering the downstream architecture.
  • (Section 4 – Performance Evaluation) As a result, the recall accuracy and False Acceptance Rate (FAR) were measured to be >95% and <2%, respectively, indicating that the constructed dataset and training strategy are effective for reliable human detection under realistic smart-home scenarios. Under identical evaluation settings, we observe that the recall std is typically around ~0.5%, and the false alarm rate (FAR) std is typically around ~0.1%. However, these statistical values are condition-dependent and should not be interpreted as universal constants. In home surveillance, the DVS event stream varies with ambient illumination and subject–sensor distance. As a result, both the mean performance and its dispersion change across environmental buckets. For example, when the illumination is ≥10 lux and the subject is within 5 m, recall is approximately 98%; under the same illumination, recall decreases to approximately 96% when the distance increases to 5–7 m. Under dimmer illumination (5–10 lux), recall is approximately 96% within 5 m, and further drops to approximately 92% at 5–7 m. Importantly, the std also differs across these buckets, reflecting different levels of event sparsity and edge contrast under varying illumination and range.
  • (Section 4 - Comparison) To benchmark the occupancy-detection capability claimed in this paper, we use the performance of conventional Passive-Infra-Red-based (PIR-based) motion/occupancy sensors as a practical baseline for comparison. Importantly, PIR reliability is highly dependent on installation and environmental conditions; for example, reported presence-detection accuracy can be as low as ~60% under typical ceiling placement and can improve to ~84% under more favorable placement, highlighting substantial variability in real deployments [33]. Moreover, long-term field testing in a single-family home reports an overall accuracy of 83.8% with a 12.8% false-positive rate (FPR) for commercial occupancy-presence sensing, and explicitly identifies failure modes during prolonged static periods (e.g., sleep) [34]. Against this background, the high-recall and low-false-alarm operating points reported in this work (while satisfying strict edge compute/latency constraints) indicate that the proposed approach is a meaningful step beyond PIR-based baselines for occupancy-aware smart-home services.

- reference 2: Gallego, G.; Delbrück, T.; Orchard, G.; Bartolozzi, C.; Taba, B.; Censi, A.; Leutenegger, S.; Davison, A.J.; Conradt, J.; Daniilidis, K.; Scaramuzza, D. Event-based vision: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 2022, 44, 154–180.

- reference 3: Cimarelli, C.; Millan-Romera, J.A.; Voos, H.; Sanchez-Lopez, J.L. Hardware, algorithms, and applications of the neuromorphic vision sensor: a review. Sensors 2025, 25(19), 6208.

- reference 4: Cazzato, D.; Bono, F. An application-driven survey on event-based neuromorphic computer vision. Information 2024, 15(8), 472.

- reference 5: Chakravarthi, B.; Verma, A.A.; Daniilidis, K.; Fermüller, C.; Yang, Y. Recent event camera innovations: a survey. arXiv 2024, arXiv:2408.13627.

- reference 17: Radomski, A.; Georgiou, A.; Debrunner, T.; Li, C.; Longinotti, L.; Seo, M.; Kwak, M.; Shin, C.; Park, P.; Ryu, H.; et al. Enhanced frame and event-based simulator and event-based video interpolation network, arXiv 2021, arXiv:2112.09379.

- reference 18: Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.-Y.; Kot, A. C. NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 2020, 42, 2684–2701.

- reference 33: Azizi, S.; Rabiee, R.; Nair, G.; Olofsson, T. Effects of positioning of multi-sensor devices on occupancy and indoor environ-mental monitoring in single-occupant offices. Energies 2021, 14(19), 6296.

- reference 34: Pang, Z.; Guo, M.; O’Neill, Z.; Smith-Cortez, B.; Yang, Z.; Liu, M.; Dong, B. Long-term field testing of the accuracy and HVAC energy savings potential of occupancy presence sensors in a single-family home. Energy and Buildings 2025, 328, 115161.

Comments 6: Enhance the resolution of all figures. Zoom in on your file to confirm that no content is blurred.

Response 6: Thank you for the comment. In the revised manuscript, we carefully inspected all figures by zooming in the document and confirmed that the contents are clearly legible and not blurred. We also clarify that the sensor images shown in Figures 1, 5, and 7 are fundamentally limited by the native sensor resolution. In our experiments, the event-based sensor outputs VGA resolution (640 × 480), and therefore these images cannot be upsampled beyond the sensor’s intrinsic spatial resolution. In addition, the Figure 1(a) CIS example contains unavoidable motion blur because it was captured under human motion; this blur is an inherent limitation of conventional frame-based imaging in such dynamic scenes rather than a rendering or formatting issue in the manuscript.

Comments 7: Limited details were presented for the methodology, including equations, pseudo-code, etc.

Response 7: Thank you for the comment. We agree that the previous version did not provide sufficient methodological detail. In the revised manuscript, we expanded the methodology by adding explicit mathematical definitions for the proposed timestamp-based, polarity-agnostic recency encoding (including all parameters and boundary conditions). We also added clearer descriptions of data set, network simplifications, and evaluation metric in the experimental sections.

 

  • (Section 3) Let an event be ei = (xi, yi, ti, pi), where (xi, yi) is the pixel location, ti is the timestamp, and piÎ{+1, -1} is the polarity. At a sampling time tk, we define the polarity-agnostic recency surface

                                            (1)

i.e., the most recent event time at each pixel regardless of polarity. The per-pixel recency is Dt(x, y) = tkS(x, y). To convert recency into an 8-bit intensity image, we use an exponential mapping with a temporal sensitivity parameter Ts and intensity amplitude Imax:

                                  (2)

where pixels with more recent events become brighter. In our indoor setup, we set Ts = 100 ms and Imax = 255 for 8-bit grayscale scaling. Empirically, Ts acts as a motion-dependent temporal sensitivity knob. We found that Ts≈20 ms is suitable for near-field, fast hand-gesture motion (~1 m), whereas Ts≈100 ms provides the best trade-off for typical indoor human motion at longer range (~5 m), preserving moving-edge structure while attenuating spurious/noisy events. Accordingly, we use Ts=20 ms for hand posture recognition and Ts=100 ms for human detection and pose estimation in this work.

  • (Section 4) Privacy and user acceptance are primary constraints in in-home sensing. Prior studies on video-based in-home/assisted-living monitoring report that the acceptance of conventional RGB cameras can be limited, with privacy concerns and perceived intrusiveness being major barriers particularly for intimate situations that may occur in private spaces [25]. Accordingly, we employ a DVS not only for computational efficiency but also as a privacy-aware sensing modality. Event representations have been discussed as a viable direction for privacy-preserving surveillance because they mainly encode moving boundaries of the subject while discarding much of the redundant visual content [26]. Object detection is required for home surveillance. Here, we employed DenseNet and Darknet architectures as feature extractors for human detection. The Faster R-CNN (FRCNN) structure gives the probabilistic location (region proposal) of humans. Figure 5 shows the region proposal results based on CIS and DVS images. CIS image needs huge amount of computations because it has many proposals due to background information while DVS image includes only moving foreground objects, which in turn reduces the computational cost dramatically [27]. In FRCNN [28], the Region Proposal Network (RPN) generates candidate regions, and the downstream RoI classification/regression cost scales with the number of retained proposals. Following the standard Faster R-CNN setting reported in the original paper (i.e., using N = 300 proposals per image while maintaining strong detection accuracy), we can visualize the top-300 RPN proposals for the CIS-based baseline in Figure 5a. Figure 5b uses the DVS-specific detector described in the latter part of Section 4 and shows that the average retained proposal count is ~9, i.e., reduced by a few tens of times. Since the primary motivation for using DVS in this work is edge deployment and model light-weighting, the proposal-count reduction provides a clear quantitative explanation of why DVS is beneficial in our system.
  • (Section 4) In total, we utilized 19.8 M DVS images for the training. Specifically, we generated 19.8M recency frames at 10 Hz from approximately 550 hours of recordings collected across 10 rooms, three days, and 8 participants. In addition, we constructed an independent validation and test data (2M DVS images in each dataset) collected under different home conditions (e.g., different rooms and/or different time periods) that was not used for training. While the proposed approach is effective for static-camera indoor occupancy detection, its performance may degrade under conditions that violate the deployment assumptions, such as strong camera ego-motion, outdoor scenes with dense background events (e.g., wind-driven foliage, rain, strong illumination flicker), or tasks requiring fine-grained multi-class recognition across many object categories. Our method is therefore best interpreted as an edge-oriented representation and lightweight inference strategy for home surveil-lance. As a result, the recall accuracy and False Acceptance Rate (FAR) were measured to be >95% and <2%, respectively, indicating that the constructed dataset and training strategy are effective for reliable human detection under realistic smart-home scenarios. Under identical evaluation settings, we observe that the recall std is typically around ~0.5%, and the false alarm rate (FAR) std is typically around ~0.1%. However, these statistical values are condition-dependent and should not be interpreted as universal constants. In home surveillance, the DVS event stream varies with ambient illumination and subject–sensor distance. As a result, both the mean performance and its dispersion change across environmental buckets. For example, when the illumination is ≥10 lux and the subject is within 5 m, recall is approximately 98%; under the same illumination, recall decreases to approximately 96% when the distance increases to 5–7 m. Under dimmer illumination (5–10 lux), recall is approximately 96% within 5 m, and further drops to approximately 92% at 5–7 m. Importantly, the std also differs across these buckets, reflecting different levels of event sparsity and edge contrast under varying illumination and range.
  • (Section 5) A pose estimation task can be used to control home environment (i.e., illumination and temperature) and interact with video game console. It has been recently reported that DVS images can be used for human pose estimation [35]. In practical edge AI scenarios, pose estimation is a useful semantic sensing function because it provides compact and actionable information (key-points) rather than full image content, which is well matched to the sparsity and motion-centric characteristic of DVS outputs. Here, we used the MS COCO 2017 key-point detection benchmark. The COCO key-point annotations define 17 anatomical key-points for each person. In our experiments, we adopted a top-down single-person setting, where each training sample corresponded to a single-person instance crop extracted from the COCO images (i.e., the multi-person images were converted into single-person training instances by cropping per annotated person). In terms of official COCO-2017 splits, the key-point task is based on train2017/val2017 [36]; for the key-point-labeled subset, commonly used splits include 56,599 training images and 2,346 validation images with key-point annotations. In the top-down instance-based formulation, this corresponds to approximately 149,813 person instances for training and 6,352 person instances for validation, where each person instance is treated as one single-person sample. Because our sensing modality was event-based, we converted the COCO single-person crops into synthetic event streams using an event simulator [17]. We evaluated pose estimation using the COCO key-point evaluation protocol based on Object Key-point Similarity (OKS), which plays a role analogous to IoU in detection. A predicted pose is matched to a ground-truth pose and considered correct if OKS ≥ 0.5. To design key-points estimation network models for pose recognition, we utilized HRNet because its performance is superior to other networks [36]. HRNet is effective in maintaining high-resolution representations and fusing multi-scale features, which generally improves localization accuracy of body joints. However, because original HRNet has lots of redundant computations for processing DVS images, we pruned the backbone network to reduce the network size as described in Section 4. There is a trade-off between accuracy and computation. During the pruning procedure, we found that the number of stages and channels was strongly related to accuracy. For example, when the number of stages was reduced below three, accuracy dropped by 4%. Similarly, when the number of channels was halved, accuracy dropped by 14.7%. These observations indicate that, although DVS inputs are sparse, pose estimation still requires sufficient representational capacity to preserve fine spatial cues for joint localization. Therefore, rather than uniformly shrinking the model, we selectively reduced the redundant blocks such as high-resolution modules, ResNet blocks, branches, and connections in fusion layer, while maintaining the essential stages and channel capacity that directly affect accuracy. Table 3 shows the comparisons between Vanilla HRNet [36] and the proposed light-weighted HRNet in terms of model size, accuracy, and processing time. As shown in Table 3, the proposed HRNet reduces the model size from 127 MB to 19 MB and improves processing time from 70 ms to 6 ms on Nvidia Titan X, while maintaining comparable pose estimation accuracy (0.95 to 0.94). We confirmed that we could achieve more than 11 times speed-up by using the event-based processing while maintaining the accuracy. Recent frame-based 2D human pose estimation commonly adopts HRNet-W48 as a strong baseline, and performance improvements are often achieved by either increasing model capacity or using multiple networks during training/inference. In particular, DE-HRNet [37] enhances HRNet-style high-resolution features by introducing detail-enhancement components and reports a modest gain on COCO test-dev (384×288): compared with HRNet-W48 (AP 0.755 with 63.6M parameters), DE-HRNet-W48 reports AP 0.757 while increasing the backbone size and compute to 74.8M parameters (which corresponds to an FP16 checkpoint size increase from roughly 127 MB to 150 MB). In a different direction, Boosting Semi-Supervised 2D HPE [38] improves accuracy primarily via stronger semi-supervised training and, for higher performance, adopts a dual-network setting (i.e., two identical yet independent networks). In their COCO test-dev comparison, the dual-network configuration effectively doubles model capacity relative to a single HRNet-W48 backbone and achieves improved accuracy (e.g., AP 0.772 in a dual setting). In contrast, our work targets event-based edge deployments (e.g., static-camera indoor monitoring) and focuses on improving the event-to-image representation (polarity-agnostic recency encoding) so that competitive accuracy and real-time operation can be achieved without scaling up the downstream network. In other words, while benchmarks exemplify the common strategy of improving COCO pose accuracy by increasing back-bone capacity or the number of networks, our approach emphasizes representation-level efficiency tailored to DVS streams and edge constraints.
  • (Section 6) The measured recall and False Acceptance Rate (FAR) were 99.19% and 0.0926%, respectively. We measured FAR at the sequence level (not per-frame). Each test sample corresponds to a 10-frame gesture sequence, and the classifier outputs per-frame confidence scores that are aggregated into a single sequence-level decision using majority voting. A sequence is accepted only when the maximum class confidence exceeds a predefined threshold. We define false acceptance as the case where a non-target (negative) sequence is incorrectly accepted as one of the gesture commands. Accordingly, FAR is computed as: FAR = NFP/Nneg x 100(%) where NFP is the number of non-matching (ground-truth ≠ predicted) sequences that are accepted, and Nneg is the total number of negative sequences in the test set. Specifically, each decision is made from a 10-frame sequence, and we set a voting threshold over these 10 frames to minimize FAR (i.e., a gesture is accepted only if sufficient per-frame votes/support are accumulated within the 10-frame window). These results indicate that the proposed approach can provide both high detection sensitivity (high recall) and strong robustness against false triggers (low FAR), which are important requirements for HMI applications where user experience and safety depend on stable recognition outputs. In addition, we confirmed that the overall latency was measured to be 14.31 ms (@i5-4590 CPU, single core), which was enough to be used for real-time interaction even in a low-end processor. This low latency suggests that the proposed event-based posture recognition can be integrated into always-on edge devices without requiring GPUs or high-end NPUs, and it can be combined with other event-driven perception modules (e.g., detection and pose estimation) to build a complete low-power interactive vision system.

Comments 8: Results for fine-tuning the model were missing.

Response 8: Thank you for the comment. In the revised manuscript, we clarified that after each structural modification for lightweighting (e.g., reducing network stages or halving channels), we fine-tuned the updated model to recover accuracy, and we report the final post–fine-tuning performance for each configuration.

  • (Section 5) During the pruning procedure, we found that the number of stages and channels was strongly related to accuracy. For example, when the number of stages was reduced below three, accuracy dropped by 4%. Similarly, when the number of channels was halved, accuracy dropped by 14.7%. In all such compression steps, we fine-tuned the modified network after each stage/channel reduction, and we report the final post–fine-tuning performance (after accuracy recovery) for each configuration. These observations indicate that, although DVS inputs are sparse, pose estimation still requires sufficient representational capacity to preserve fine spatial cues for joint localization. Therefore, rather than uniformly shrinking the model, we selectively reduced the redundant blocks such as high-resolution modules, ResNet blocks, branches, and connections in fusion layer, while maintaining the essential stages and channel capacity that directly affect accuracy.

Comments 9: Compare your method with existing methods.

Response 8: Thank you for the suggestion. In the revised manuscript, we strengthened the comparison with existing methods in two ways. First, we described representation baselines in Section 3, while keeping the downstream network fixed to isolate the effect of the proposed polarity-agnostic timestamp encoding. Second, in Sections 4–5 we expanded the quantitative comparison against recent event-based methods by reporting and citing published results for representative modern baselines (e.g., RVT and SAST for event-based detection, and recent event-based pose-estimation backbones), and we added a clearer comparison table summarizing accuracy and efficiency (e.g., FLOPs/latency/model size). Since many state-of-the-art benchmarks target different datasets and multi-class settings, we explicitly state the task/dataset differences and provide contextual comparisons, while emphasizing that our work focuses on deployment-oriented smart-home occupancy sensing under strict edge constraints.

  • (Section 3) To validate the proposed technique, we performed the action recognition task using the human activity data set and DVS event simulator [17]. The public NTU RGB+D 120 human activity dataset includes a large-scale benchmark containing 120 action classes and 114,480 samples captured from 106 subjects [18]. The dataset provides synchronized multi-modal streams including RGB, depth, infrared (IR), and 3D skeletons (25 body joints) recorded with three Microsoft Kinect v2 cameras. On NTU RGB+D 120, Table 1 reports a representation-level ablation in which the downstream network and training protocol are fixed and only the event-to-image encoding is changed. The proposed polarity-agnostic recency image improves top-1 accuracy by from 89.6% (temporal accumulation) to 90.8% (+1.2%), indicating that recency-aware edge encoding provides more stable cues than pure event counts without altering the downstream architecture. The improvement indicates that the timestamp-based encoding provides a more informative and robust representation than temporal accumulation, especially when the motion level is low or intermittent. This is meaningful for real edge environments because the input stream is often sparse and non-stationary, and a stable representation can directly improve downstream recognition reliability without increasing the data volume. For context, recent NTU120 action-recognition literature (2023–2025) typically reports accuracies around 90–92% for strong skeleton-based methods, while multimodal approaches may reach into the low-to-mid 90% range with additional modalities and higher compute [19–24]. Because our Table 1 is designed to isolate the impact of encoding under a fixed downstream network (and uses event-simulated inputs), these numbers serve as a contextual benchmark level rather than a direct SOTA comparison.
  • (Section 4) To contextualize compute and accuracy on modern event-based detection benchmarks, we reference RVT [31] and its recent sparse-attention variants SAST [32], which are representative state-of-the-art event-based detectors on Prophesee automotive datasets (e.g., Gen1 and 1Mp). In these benchmarks, the event sensor is mounted on a moving vehicle, so the input stream contains strong ego-motion and dense background events (roads, buildings, trees, signs), making foreground separation substantially more challenging than static-camera indoor monitoring. SAST further reports backbone compute in terms of FLOPs and “A-FLOPs” (attention-related FLOPs) averaged over test samples, showing that RVT and SAST typically operate in the 0.8–2.2G A-FLOPs regime (and higher when counting full backbone FLOPs), reflecting the need for large transformer-style backbones to achieve high mAP under automotive ego-motion. The much smaller compute in our method (81 MFLOPs) is because the problem setting and design objective are different. First, our target scenario assumes a static DVS in indoor monitoring, where background is largely suppressed and the stream is dominated by edges induced by moving subjects; therefore, the detector primarily needs to recognize edge shape presence (e.g., person vs. non-person or presence counting) rather than solve full-scene, ego-motion-compensated multi-object detection. Second, we intentionally design an ultra-lightweight backbone (e.g., aggressive channel reduction, removing layers, and using larger early strides) to minimize edge compute, which directly reduces FLOPs by an order of magnitude compared to SOTA transformer backbones. Third, in our static-camera setting the DVS naturally sup-presses background, which reduces the number of candidate regions/proposals and con-tributes to runtime reduction beyond what FLOPs alone predicts (we observe large end-to-end latency reductions together with compute reductions). In contrast, RVT/SAST are designed for automotive ego-motion benchmarks and aim at high mAP under dense background events, which requires substantially larger capacity and attention computation even after sparsity optimization.

(Section 5) Table 3 shows the comparisons between Vanilla HRNet [36] and the proposed light-weighted HRNet in terms of model size, accuracy, and processing time. As shown in Table 3, the proposed HRNet reduces the model size from 127 MB to 19 MB and improves processing time from 70 ms to 6 ms on Nvidia Titan X, while maintaining comparable pose estimation accuracy (0.95 to 0.94). We confirmed that we could achieve more than 11 times speed-up by using the event-based processing while maintaining the accuracy. Recent frame-based 2D human pose estimation commonly adopts HRNet-W48 as a strong baseline, and performance improvements are often achieved by either increasing model capacity or using multiple networks during training/inference. In particular, DE-HRNet [37] enhances HRNet-style high-resolution features by introducing detail-enhancement components and reports a modest gain on COCO test-dev (384×288): compared with HRNet-W48 (AP 0.755 with 63.6M parameters), DE-HRNet-W48 reports AP 0.757 while increasing the backbone size and compute to 74.8M parameters (which corresponds to an FP16 checkpoint size increase from roughly 127 MB to 150 MB). In a different direction, Boosting Semi-Supervised 2D HPE [38] improves accuracy primarily via stronger semi-supervised training and, for higher performance, adopts a dual-network setting (i.e., two identical yet independent networks). In their COCO test-dev comparison, the dual-network configuration effectively doubles model capacity relative to a single HRNet-W48 backbone and achieves improved accuracy (e.g., AP 0.772 in a dual setting). In contrast, our work targets event-based edge deployments (e.g., static-camera indoor monitoring) and focuses on improving the event-to-image representation (polarity-agnostic recency encoding) so that competitive accuracy and real-time operation can be achieved without scaling up the downstream network. In other words, while benchmarks exemplify the common strategy of improving COCO pose accuracy by increasing back-bone capacity or the number of networks, our approach emphasizes representation-level efficiency tailored to DVS streams and edge constraints.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The revised manuscript demonstrates clear improvements in the clarity of presentation, methodological framing, and experimental description. The contribution is now more precisely positioned as a deployment-oriented edge AI study, with a clearer explanation of the polarity-agnostic timestamp-based encoding and its relation to existing time-surface representations. The experimental sections provide better-defined datasets, evaluation protocols, and condition-dependent analysis, and the discussion of scope and limitations is more explicit and balanced.

Author Response

Reviewer’s comment. The revised manuscript demonstrates clear improvements in the clarity of presentation, methodological framing, and experimental description. The contribution is now more precisely positioned as a deployment-oriented edge AI study, with a clearer explanation of the polarity-agnostic timestamp-based encoding and its relation to existing time-surface representations. The experimental sections provide better-defined datasets, evaluation protocols, and condition-dependent analysis, and the discussion of scope and limitations is more explicit and balanced.

Response: Thank you for the positive second-round feedback. We believe these revisions strengthen the paper as a deployment-oriented edge AI study, and we thank the reviewer for the constructive guidance that helped improve the manuscript.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Dear Authors,

I am particularly impressed by your efforts to address all comments with attention to detail and depth, which clearly demonstrates your dedication to the scientific value of this work. Your work is an inspiration to readers.

I have no further questions or comments regarding the article.

Best regards!

Author Response

Reviewer’s comment.

Dear Authors,

I am particularly impressed by your efforts to address all comments with attention to detail and depth, which clearly demonstrates your dedication to the scientific value of this work. Your work is an inspiration to readers.

I have no further questions or comments regarding the article.

Best regards!

Response: Thank you very much for your kind and encouraging comments. We sincerely appreciate your careful review and the recognition of our efforts to address the feedback in depth. Your supportive remarks are greatly motivating, and we are grateful for your conclusion that no further revisions are required.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

I have some follow-up comments on the revised article.
Follow-up comment 1. Please enhance the resolution of all figures. For acceptable figures, the content must be clear (no content is blurred) after zooming in the figures to 200%.
Follow-up comment 2. A literature review was missing after Section 1.
Follow-up comment 3. Figure 4: How do the “object motion” and “image” relate to the “DVS event stream”?
Follow-up comment 4. Very limited results were shown in Table 1. Please present them in the main paragraph.
Follow-up comment 5. Tables 2 and 3: More methods should be added to the comparison.
Follow-up comment 6. Figure 8. What is “(s)”?
Follow-up comment 7. Ablation studies were missing.
Follow-up comment 8. Discuss the limitations of the proposed work.

Author Response

Summary

Thank you for your careful review and constructive comments. We sincerely appreciate your suggestions, which helped us substantially improve the clarity, completeness, and presentation quality of the manuscript. We hope these revisions adequately address your comments and improve the manuscript’s quality and impact. Thank you again for your time and valuable feedback.

Comments 1: Please enhance the resolution of all figures. For acceptable figures, the content must be clear (no content is blurred) after zooming in the figures to 200%.

Response 1: Thank you for the comment. In the revised manuscript, we verified figure readability by zooming the final exported document to 200%. We confirm that Figures 2, 3, 4, 6, and 8 remain clear and fully legible at 200% zoom with no blurred content. For Figures 1, 5, and 7, the displayed sensor examples are intrinsically limited by the native VGA sensor resolution (640 × 480), and therefore they cannot provide spatial detail beyond the sensor’s inherent resolution without introducing artificial interpolation artifacts. In addition, the blur observed in Figure 1(a) is not due to insufficient figure resolution, but rather motion blur in the CIS frame caused by human motion during exposure.

Comments 2: A literature review was missing after Section 1.

Response 2: Thank you for the comment. In the revised manuscript, we added a dedicated literature review immediately after the Introduction as Section 2.1 (Literature Review). This section summarizes key event-based vision surveys (e.g., Gallego et al., IEEE TPAMI, 2022; Cimarelli et al., Sensors, 2025) and positions our work with respect to established event representations and deployment considerations, thereby addressing the previously missing literature review.

Comments 3: Figure 4: How do the “object motion” and “image” relate to the “DVS event stream”?

Response 3: Thank you for the question. We clarified the relationship between object motion, the DVS event stream, and the generated image in the main text. Specifically, object motion produces brightness changes along moving edges, which the DVS reports as an asynchronous event stream. Our method converts this stream at each sampling time into a polarity-agnostic recency image by storing the most recent timestamp per pixel and mapping Δt to intensity via Equation (2).

  • (Section 3) Figure 4 illustrates the conceptual flow of the proposed method. When an object moves, brightness changes occur mainly at its boundaries, and the DVS emits events at the corresponding pixels. The resulting event stream is accumulated only in the sense of updating a per-pixel last-timestamp memory (polarity ignored). At a chosen sampling time, the stored timestamps are mapped to an intensity image via Equation (2), so that recently active edges appear bright while inactive/background pixels fade, yielding an image-form input for downstream inference.

Comments 4: Very limited results were shown in Table 1. Please present them in the main paragraph.

Response 4: Thank you for the comment. In the revised manuscript, we explicitly described the Table 1 results in the main text and clarify the underlying rationale. Specifically, we explained that for NTU RGB+D 120 actions with slow, subtle, or intermittent motion, the event stream becomes sparse and temporally uneven, so simple temporal accumulation over a window can produce ambiguous event images (e.g., weak/fragmented edges or mixed traces). We then motivated that the proposed timestamp-based recency encoding reduces this ambiguity by emphasizing how recently each pixel was activated—preserving the most recent moving-edge structure while suppressing stale residual activity—thereby yielding a more discriminative input under low-motion regimes and contributing to the observed accuracy improvement (0.896 → 0.908).

  • (Section 3) On the NTU RGB+D 120 dataset, we evaluated the proposed timestamp-based (recency) encoding while keeping the downstream recognition network and training protocol fixed, in order to isolate the effect of the event-image representation. As summarized in Table 1, the proposed encoding improves the overall accuracy from 0.896 (temporal accumulation) to 0.908. This improvement is consistent with the underlying signal characteristics of event streams in NTU120: for actions with slow, subtle, or intermittent motion, the event rate becomes sparse and temporally uneven, and simple accumulation over a window can yield ambiguous images (e.g., weak edges, fragmented contours, or mixed traces from temporally separated micro-motions). In contrast, the timestamp-based recency mapping emphasizes how recently each pixel was activated (rather than how many events were accumulated), which helps preserve the most recent moving-edge structure while sup-pressing stale residual activity within the window. Consequently, the proposed recency representation provides a more discriminative input for motion-centric recognition, particularly under low-motion or intermittent-event regimes that are common in fine-grained NTU120 actions.

Comments 5: Tables 2 and 3: More methods should be added to the comparison.

Response 5: Thank you for the suggestion. We expanded Tables 2 and 3 by adding representative recent baselines. Specifically, Table 2 now includes additional event-based object detection frameworks (e.g., RVT) with their reported computational costs (FLOPs) and runtime where available. Table 3 was also extended with widely used COCO keypoint estimation baselines to provide a broader reference point in terms of model size/compute and AR@0.5 (OKS). These additions strengthen the comparative context beyond the original two-row tables.

  • (Table 2) As a result, we achieved more than 11 times speed-up by using event-based processing as shown in Table 2. To contextualize compute and accuracy on modern event-based detection bench-marks, we reference RVT and its recent sparse-attention variants SAST [31], which are representative state-of-the-art event-based detectors on Prophesee automotive datasets (e.g., Gen1 and 1Mp). In these benchmarks, the event sensor is mounted on a moving vehicle, so the input stream contains strong ego-motion and dense background events (roads, buildings, trees, signs), making foreground separation substantially more challenging than static-camera indoor monitoring. SAST further reports backbone compute in terms of FLOPs and “A-FLOPs” (attention-related FLOPs) averaged over test samples, showing that RVT and SAST typically operate in the 0.8–2.2G A-FLOPs regime (and higher when counting full backbone FLOPs, for example, the computational complexity of the detection head based on YOLOX [32] is 281.9 GFLOPs as shown in Table 2), reflecting the need for large transformer-style backbones to achieve high mAP under automotive ego-motion. The much smaller compute in our method (81 MFLOPs) is because the problem setting and design objective are different. First, our target scenario assumes a static DVS in indoor monitoring, where background is largely suppressed and the stream is dominated by edges induced by moving subjects; therefore, the detector primarily needs to recognize edge shape presence (e.g., person vs. non-person or presence counting) rather than solve full-scene, ego-motion-compensated multi-object detection. Second, we intentionally design an ultra-lightweight backbone (e.g., aggressive channel reduction, removing layers, and using larger early strides) to minimize edge compute, which directly reduces FLOPs by an order of magnitude compared to SOTA transformer backbones. Third, in our static-camera setting the DVS naturally suppresses background, which reduces the number of candidate regions/proposals and contributes to runtime reduction beyond what FLOPs alone predicts (we observe large end-to-end latency reductions together with compute reductions). In contrast, RVT/SAST are designed for automotive ego-motion benchmarks and aim at high mAP under dense background events, which requires substantially larger capacity and attention computation even after sparsity optimization.

Table 2. Computation comparison between conventional and event-based vision processing.

FRCNN+FPN

Number of Layers1

FLOPs

Accuracy

Processing Time (ms)

Conventional

Event-based

RVT [31]

91

24

YOLOX4 [32]

5.8G2

81M2

281.9G5

0.963

0.953

0.5126

1722

152

17.35

1 Backbone network, 2 Computed on Titan X, 3 Recall accuracy, 4 Detection head, 5 Computed on Tesla V100, 6 Measured AP on COCO dataset

- reference 32: Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: exceeding YOLO series in 2021, arXiv 2021, arXiv: 2107.08430.

  • (Table 3) Table 3 shows the comparisons between Vanilla HRNet [36] and the proposed light-weighted HRNet in terms of model size, accuracy, and processing time. As shown in Table 3, the proposed HRNet reduces the model size from 127 MB to 19 MB and improves processing time from 70 ms to 6 ms on Nvidia Titan X, while maintaining comparable pose estimation accuracy (0.95 to 0.94). We confirmed that we could achieve more than 11× speed-up by using the event-based processing while maintaining the accuracy. Recent frame-based 2D human pose estimation commonly adopts HRNet-W48 as a strong base-line, and performance improvements are often achieved by either increasing model capacity or using multiple networks during training/inference. In particular, DE-HRNet [37] enhances HRNet-style high-resolution features by introducing detail-enhancement components and reports a modest gain on COCO test-dev (384×288): compared with HRNet-W48 (AP 0.755 with 63.6M parameters), DE-HRNet-W48 reports AP 0.757 while increasing the backbone size and compute to 74.8M parameters (which corresponds to an FP16 checkpoint size increase from roughly 127 MB to 150 MB). In a different direction, Boosting Semi-Supervised 2D HPE [38] improves accuracy primarily via stronger semi-supervised training and, for higher performance, adopts a dual-network setting (i.e., two identical yet independent networks). In their COCO test-dev comparison, the du-al-network configuration effectively doubles model capacity relative to a single HRNet-W48 backbone and achieves improved accuracy (e.g., 0.952 @AR OKS ≥ 0.5 in a dual setting). In contrast, our work targets event-based edge deployments (e.g., static-camera indoor monitoring) and focuses on improving the event-to-image representation (polarity-agnostic recency encoding) so that competitive accuracy and real-time operation can be achieved without scaling up the downstream network. In other words, while benchmarks exemplify the common strategy of improving COCO pose accuracy by increasing backbone capacity or the number of networks, our approach emphasizes representation-level efficiency tailored to DVS streams and edge constraints.

Table 3. Comparisons between the vanilla HRNet and the proposed lightweight HRNet for DVS-based pose estimation. The table reports model size, accuracy, and inference latency measured on Nvidia Titan X. The proposed HRNet substantially reduces the model size (127 MB → 19 MB) and processing time (70 ms → 6 ms) while maintaining comparable accuracy with only a minor decrease (0.95 → 0.94), demonstrating that pruning redundant modules and connections is effective for real-time edge deployment.

Model

Size

Accuracy

Processing Time

@Nvidia Titan X

Vanilla HRNet

127 MB

0.95

70 ms

Proposed HRNet

Boosting Semi-Supervised 2D HPE

19 MB

254 MB

0.94

0.952

6 ms

~140 ms1

1 Estimated from the model size [38]

Comments 6: Figure 8. What is “(s)”?

Response 6: Thank you for pointing this out. We clarified Figure 8 by explicitly defining ‘(s)’ as the stride and updated the figure caption to use standard ‘kernel size, stride’ notation (e.g., 3×3, stride 1).

  • (Figure 8 Caption) Network structure for low-latency posture recognition based on DVS. The proposed model uses five convolutional layers to extract posture-related features from sparse event-derived images and two fully-connected layers for final classification, enabling reliable HMI operation with high recall (99.19%), low FAR (0.0926%), and real-time latency (14.31 ms on i5-4590 single core). We use the notation ‘k×k+(s)’ to indicate kernel size and stride for each convolution/pooling layer.

Comments 7: Ablation studies were missing.

Response 7: Thank you for the comment regarding the lack of ablation studies. In the revised manuscript, we added an ablation analysis in Section 6 motivated by our prior motion-recognition work (“Computationally efficient, real-time motion recognition based on bio-inspired visual and cognitive processing”), which uses five convolutional layers and three fully connected layers. Building on this baseline design, we systematically varied kernel size and stride in the early stages of the ConvNet (Figure 8), fine-tuned the model after each change, and selected the final configuration to minimize FLOPs while keeping recall accuracy and FAR unchanged. This clarifies which structural parameters most strongly affect efficiency and supports the final compact architecture reported in Section 6.

  • (Section 6) Leveraging the sparsity and binary nature of DVS images, we designed a low-latency CNN classifier as shown in Figure 8. The convolutional stack extracts hierarchical edge/motion features from the sparse inputs, and the fully connected layers perform compact classification. Following the ConvNet-based motion recognition architecture in our prior work which employs five convolutional layers and three fully connected layers [12], we started from an analogous lightweight classifier for the HMI task and performed an ablation study to minimize computation while preserving reliability. In particular, we systematically varied the kernel size and stride of the early convolution/pooling stages, because these parameters dominate the spatial downsampling rate, feature-map size, and thus overall FLOPs. After each architectural change, the network was fine-tuned and evaluated on the held-out test set; only configurations that maintained recall accuracy and FAR at the same level as the reference configuration were retained. This ablation-driven procedure resulted in the final compact design in Figure 8, achieving a substantially reduced computational cost. We implemented the proposed algorithm on a low-end PC to validate feasibility under realistic edge constraints. The measured recall and False Acceptance Rate (FAR) were 99.19% and 0.0926%, respectively.

Comments 8: Discuss the limitations of the proposed work.

Response 8: Thank you for the comment. We explicitly discuss the limitations of the proposed approach at the end of Section 7 (Discussions and Conclusions). In particular, we clarify the inherent motion–static trade-off of DVS sensing (i.e., perfectly stationary targets may generate few/no new events, which can weaken purely event-driven detection during long static periods). We also clarify the scope of this work as home-surveillance occupancy detection (person present/absent) and describe a practical mitigation strategy by combining detection with a lightweight bounding-box tracking module to maintain target state across static intervals. Finally, we state our future work direction to build and evaluate the integrated detection–tracking system for more robust operation under long static periods and slow-motion scenarios.

(Section 7) A known limitation of event-based sensing is that perfectly stationary objects may generate few or no new events, which can reduce instantaneous evidence for purely event-driven detection during long static periods. This motion–static trade-off is inherent to DVS sensing and is particularly relevant to surveillance scenarios where an intruder might remain motionless. However, our target application is home-surveillance occupancy detection (person present/absent) rather than fine-grained static recognition. In such a system, robustness to long static periods can be strengthened by combining object detection with a lightweight tracking module [46] operating on the detected bounding boxes. Specifically, once a person is detected, a tracker can maintain and update the target state over time and distinguish among three practically important cases: (i) the target leaves the field of view (track termination near image boundaries or consistent outward motion), (ii) the target enters the field of view (track initialization with inward motion), and (iii) the target remains in the scene with little or no motion (a persistent track with minimal displacement and low event rate). These considerations highlight that, while DVS sensing is intrinsically motion-driven, a practical occupancy-detection system can explicitly handle static intervals through tracking-based state maintenance without sacrificing the low-latency, edge-efficient nature of the proposed technique. As future work, we plan to build and evaluate a complete end-to-end home-surveillance system that integrates the proposed DVS-based detection with a lightweight bounding-box tracking module, enabling more robust state reasoning (enter/leave/static) under long static periods and slow-motion scenarios.

Author Response File: Author Response.pdf

Back to TopTop