A Neuro-Symbolic Approach to Fall Detection via Monocular Depth Estimation

Xu, Yinghai; Kim, Bongjun; Wang, In-Nea; Jeong, Junho

doi:10.3390/app16041895

Open AccessArticle

A Neuro-Symbolic Approach to Fall Detection via Monocular Depth Estimation

¹

Department of Computer Science and Engineering, Dongguk University, Seoul 04620, Republic of Korea

²

IoT Convergence and Open Sharing System, Dongguk University, Seoul 04623, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2026, 16(4), 1895; https://doi.org/10.3390/app16041895

Submission received: 17 November 2025 / Revised: 21 December 2025 / Accepted: 16 January 2026 / Published: 13 February 2026

Download

Browse Figures

Versions Notes

Abstract

Falls remain a critical safety concern in surveillance settings, yet monocular RGB methods often degrade in multi-person scenes with occlusion and loss of three-dimensional cues. This study proposes a neuro-symbolic framework that restores physically interpretable depth proxies from monocular video and fuses them with skeleton-based spatio-temporal inference for robust fall detection. The pipeline estimates per-frame depth and 2D skeletons, recovers world coordinates for key joints, and derives absolute neck height and vertical descent rate for rule-based adjudication, while a neural method operates on joint trajectories; final decisions combine both streams with a logical policy and short-horizon temporal consistency. Experiments in a realistic indoor testbed with multi-person activity compare three configurations—neural, symbolic, and fused. The fused neuro-symbolic method achieved an accuracy of 0.88 and an F1 score of 0.76 on the real surveillance test set, outperforming the neural method alone (accuracy 0.81, F1 0.64) and the symbolic method alone (accuracy 0.77, F1 0.35). Gains arise from complementary error profiles: depth-derived, rule-based cues suppress spurious positives on non-fall frames, while the neural stream recovers true falls near rule boundaries. These findings indicate that integrating monocular depth proxies with interpretable rules improves reliability without additional sensors, supporting deployment in complex, multi-person surveillance environments.

Keywords:

fall detection; neuro-symbolic learning; monocular depth estimation; skeleton-based action recognition; spatio-temporal graph convolutional networks; video surveillance

1. Introduction

Falls constitute a major safety concern across all age groups and can lead to severe injury, particularly among older adults and in crowded or constrained environments, thereby necessitating rapid detection and response [1,2]. In parallel with advances in computer vision, a growing body of work has explored human activity recognition and fall detection using closed-circuit television (CCTV) and related video-surveillance infrastructures, with promising applicability not only in hospitals and long-term care facilities but also in broader safety-critical surveillance settings [3]. Detention facilities are a representative example: limited staff must supervise multiple detainees simultaneously [4], which hinders real-time situation awareness and timely intervention when abnormal events such as falls occur. Consequently, deploying anomaly detection systems tailored to this context has become essential for ensuring detainee safety and improving operational efficiency.

Monocular camera-based fall detection models [5] often achieve high accuracy under controlled conditions but degrade in real surveillance scenarios due to camera viewpoint, illumination variability, and background complexity [6]. In multi-person scenes, overlaps and occlusions increase joint localization errors, causing confusion in activity recognition and leading to false fall alarms. Moreover, the projection of a 3D scene onto a 2D image discards critical cues such as depth, absolute height, and viewpoint invariance [7], so rapid non-fall motions (e.g., abrupt sitting or standing) may be misclassified as falls. To mitigate these limitations, skeleton-based approaches explicitly model the human joint–bone topology, reducing sensitivity to appearance and background changes and providing improved robustness in complex scenes [8].

Nevertheless, skeleton-based approaches exhibit structural constraints in real deployments. In multi-person scenes, frequent occlusions and overlaps cause keypoint dropouts and cross-identity skeleton swaps, leading to tracking interruptions and increased false positives and false negatives. Ethical and safety considerations further limit the availability of real fall footage, making it difficult to secure the diversity and balance required for training [9]. Synthetic data generated in simulation engines have been introduced to mitigate this issue by systematically producing fall and non-fall scenarios across varied viewpoints, illumination, and backgrounds, thereby improving generalization [10]; however, the synthetic-to-real distribution gap and occlusion-induced skeleton missingness persist. Consequently, fall detection suitable for operational surveillance should augment skeleton-based neural inference with complementary cues and interpretable rules, notably by integrating monocular depth-derived three-dimensional indicators such as absolute body height and vertical descent rate.

This study proposes a neuro-symbolic framework for reliable fall detection in surveillance settings that employ only a fixed monocular RGB camera, even under multi-person, occlusion-prone conditions. Monocular depth estimation is used to produce per-frame depth maps, which are fused with pose-estimated skeletons to recover three-dimensional proxy variables at the neck landmark, specifically absolute height above the floor and vertical descent rate. These variables are evaluated by a rule set with physically interpretable thresholds for floor proximity and descent speed to yield the symbolic decision. In parallel, a skeleton-based spatio-temporal model trained on game-engine synthetic data classifies behavior and outputs a fall probability. The final frame-level label is obtained by logically fusing the two outputs and enforcing short-term temporal consistency, with a policy that flags a frame as a fall if at least one person is detected as falling within that frame. By combining the sensitivity of the learned model with the precision and transparency of rule-based reasoning, the approach simultaneously reduces false positives and false negatives while providing threshold-based interpretability. Limiting inputs to RGB lowers deployment cost, whereas integrating monocular depth cues enhances robustness in complex environments such as detention facilities; the framework is also amenable to extension toward multi-person anomaly detection beyond falls.

The remainder of this paper is organized as follows. Section 2 surveys prior work on fall detection and summarizes relevant advances. Section 3 details the proposed neuro-symbolic architecture and its core components. Section 4 describes the experimental environment, dataset construction, and evaluation metrics. Section 5 presents comparative experiments and both quantitative and qualitative analyses to validate the proposed approach. Section 6 discusses limitations and avenues for improvement. Section 7 concludes the paper.

2. Related Work

2.1. RGB-Only Appearance and Motion Approaches

RGB-based human activity recognition and fall detection offer high cost-effectiveness and accessibility because they can operate on existing CCTV infrastructure without additional sensors. Early studies relied on handcrafted features—such as background subtraction, optical flow, and motion history images—to summarize either region-of-interest dynamics or global motion for classification [11]. With advances in convolutional and recurrent neural networks, two-stream architectures were introduced in which one pathway learns static appearance and pose while the other captures optical-flow-based motion cues, thereby improving recognition accuracy [12]. Building on body-part detection, part-centric analyses have further quantified local movements; for example, Amir et al. reported accurate separation of falls from non-fall activities with 89.09% on UCF and 88.26% on IM event datasets using spatio-temporal multi-dimensional features [13,14,15].

Despite these developments, RGB-only approaches—often designed and evaluated under controlled conditions—tend to be vulnerable in operational surveillance due to illumination changes, background clutter, occlusion and overlap, and viewpoint diversity. Moreover, because RGB provides only a two-dimensional projection, it is difficult to recover three-dimensional geometric cues such as subject–camera distance, absolute body height, and posture inclination in a reliable manner [7]. Consequently, rapid non-fall motions (e.g., abrupt sitting or standing) and floor-contact assessment can yield elevated false positives and false negatives. To address these limitations, the next subsection reviews 3D-cue-based methods, including RGB-D sensors and monocular depth estimation.

2.2. Depth-Augmented Approaches

RGB-D-based fall detection has attracted sustained interest because depth cues, when combined with RGB, enable quantitative characterization of spatial layout and postural transitions [16]. Following the proliferation of Kinect-class sensors, numerous methods leveraged physically meaningful indicators—distance to floor, absolute body-part height, and abrupt posture changes—for decision making. Xu et al. applied fixed-threshold rules to skeletons extracted from Kinect V2 and demonstrated real-time feasibility [17], while Kong et al. fused RGB and depth to distinguish normal, abnormal, and fall states, empirically confirming the utility of depth information [18]. Nida et al. further integrated silhouettes, 3D coordinates, geodesic distances, skeleton-based motion-capture features, and waypoint trajectories, and employed a neuro-fuzzy classifier, reporting high accuracy on NTU RGB + D, UoL 3D Social Activity, and CAD datasets [19,20,21,22]. In parallel, depth-only pipelines have been explored for privacy-sensitive settings, including real-time systems that combine automatic sensor calibration with probabilistic state estimation [23] and lightweight rule-based designs that classify behaviors using distance and positional features derived solely from depth maps [24].

Despite these advances, depth-sensor approaches face practical limitations in operational surveillance: reliance on dedicated hardware, blind spots arising from field-of-view and range constraints, and quality degradation under reflections, occlusions, and illumination changes. Large or cluttered spaces and multi-person scenes also increase deployment cost, operational complexity, and the difficulty of instance separation under occlusion. Motivated by these constraints, the present work adopts monocular depth estimation to recover 3D proxy variables from RGB alone and integrates them with a skeleton-based neural module and a rule-based symbolic module. This design preserves deployment simplicity while introducing physically interpretable indicators, thereby aiming to improve robustness and interpretability in real-world environments.

2.3. Skeleton-Based Spatio-Temporal Approaches

Skeleton-based methods classify actions by modeling joint–bone coordinate sequences in space and time, which makes them comparatively insensitive to appearance and background variation and thus attractive for surveillance. Early work fed skeletons extracted by pose estimators such as OpenPose into CNN-LSTM pipelines to learn motion patterns [25,26,27,28]. Subsequently, Spatio-Temporal Graph Convolutional Networks (ST-GCN) were introduced, representing joints as graph nodes and connecting them along kinematic links and temporal edges to capture structured human dynamics more effectively [29]. Additional progress on model compactness and regularization has yielded further gains in efficiency and generalization on public benchmarks such as NTU RGB + D [20,30]. Recent advances in real-time multi-person pose estimation have also improved skeleton quality in complex scenes; notably, one-stage architectures with coordinate classification have achieved a favorable balance between speed and accuracy [31].

Despite these advances, operational surveillance remains challenging. Multi-person scenes frequently induce overlaps and occlusions, leading to keypoint dropouts, cross-identity skeleton swaps, and tracking interruptions, which in turn elevate false positives and false negatives. Moreover, purely two-dimensional skeletons do not directly encode critical three-dimensional cues such as absolute height above the floor or vertical descent rate. To address these limitations, the present work augments the spatio-temporal skeleton representation with three-dimensional proxy variables recovered via monocular depth estimation and fuses the resulting neural inference with a rule-based symbolic module, thereby enhancing robustness and interpretability in occlusion-prone multi-person environments.

2.4. Knowledge- and Logic-Augmented Approaches

Although RGB-, RGB-D-, skeleton-, and depth-cue-based methods each offer distinct advantages, recurrent limitations in interpretability and out-of-distribution generalization have been documented under complex surveillance conditions [32]. Neuro approaches centered on deep learning achieve high predictive performance by learning intricate patterns from large-scale data; however, the internal decision processes are difficult to inspect and verify, constraining trustworthy deployment in safety-critical settings [33]. In contrast, symbolic approaches derive decisions from explicitly defined rules and logic, yielding clear rationales and auditability, yet they exhibit limited expressive power when directly handling continuous, noise-prone video streams; accordingly, recent surveys argue for tighter integration with perceptual models [34]. Motivated by this complementarity, neuro-symbolic research seeks to combine the representational and predictive strengths of neural models with the transparency and controllability of symbolic reasoning [34].

Within this line of work, knowledge graph methods encode high-level context by modeling objects (e.g., person, chair, floor), attributes (e.g., position, state), and relations (e.g., leaning, collision) as nodes and edges, thereby improving both performance and explainability in composite behaviors that are difficult to discriminate from frame-level features alone. Ma et al. integrated a visual knowledge graph into a learned model and reported gains under limited data regimes [35]. Scene-graph-based methods, in turn, construct an object relation graph per frame and link these over time to form a video graph, enabling logical inference with predefined behavioral rules and providing narrative explanations of who did what, where, and when [36]. Zhuo et al. demonstrated strong explanatory power for complex action sequences using this paradigm [37]. Nonetheless, challenges remain, including dataset biases toward single-subject scenarios, the cost of graph construction and maintenance, and limited scalability to real surveillance environments characterized by multiple subjects and frequent occlusions.

To address the foregoing limitations, this work recovers three-dimensional proxy variables—specifically, absolute neck height above the floor and vertical descent rate—from depth estimated with a monocular camera and links these variables to physically grounded thresholds within a rule-based symbolic module. In parallel, a skeleton-based neural module provides a spatio-temporal representation of behavior. The two outputs are combined through a logical fusion procedure that includes a short-horizon temporal consistency check, thereby improving deployment practicality without dedicated RGB-D hardware and increasing the transparency of the decision rationale. The design aims to jointly reduce false positives and false negatives in multi-person, occlusion-prone surveillance settings, while supporting decision making with interpretable, physically meaningful criteria.

3. Proposed Fall Detection Method

This study proposes a neuro-symbolic fall detection framework that reconstructs three-dimensional cues from depth estimated with a monocular RGB camera and integrates these cues with a learned neural module and a rule-based reasoning module. We first outline the end-to-end pipeline and describe the principal components of each stage. We then detail input preprocessing, coordinate calibration, and the transformation from image coordinates to real-world coordinates using monocularly estimated depth. Finally, we present the decision algorithm that logically fuses spatio-temporal inference from skeleton sequences with symbolic rules derived from depth-induced proxy variables to determine the final fall label. The design seeks to combine the predictive strength of neural learning with the interpretability of symbolic reasoning, thereby addressing the complementary limitations of prior approaches and improving deployability in operational surveillance environments.

3.1. Overview of the Proposed Fall Detection Method

We propose a framework for precise fall discrimination in monocular RGB surveillance that combines three-dimensional coordinate recovery from estimated depth with a neuro-symbolic decision process. The pipeline comprises two stages. The first stage, termed data transformation, reconstructs three-dimensional spatial information from skeleton keypoints and per-pixel depth. The second stage, termed fall determination, applies a neural classifier and a rule-based classifier in parallel and then performs logical fusion to produce a final frame-level fall label. The overall workflow is illustrated in Figure 1.

The data transformation stage processes each RGB frame to obtain a two-dimensional skeleton

(x, y)

from the pose estimator and a per-pixel depth value

d

from the depth estimator. For a set of pre-acquired reference correspondences, the fused image–depth triplets

(x, y, d)

are paired with measured world coordinates

(X, Y, Z)

. A linear image-to-world transform

W

is estimated by least squares from these correspondences, and applying

W

to each joint triplet

(x, y, d)

yields the reconstructed three-dimensional position

(X, Y, Z)

.

The fall determination stage runs two modules in parallel. The neural module feeds the two-dimensional skeleton sequence to a behavior-recognition model trained on synthetic data and returns a fall probability, denoted

\hat{N}

. The symbolic module computes physically meaningful indicators from the reconstructed three-dimensional skeleton, including absolute height and vertical descent rate at key joints, and assigns a fall label according to a predefined rule set, producing

\hat{S}

. The final decision applies a Boolean fusion policy with a short-term temporal consistency constraint, thereby leveraging the sensitivity of data-driven inference together with the precision of rule-based reasoning for reliable and interpretable detection in single-camera, multi-person, occlusion-prone environments.

3.2. Image–World Coordinate Recovery

Each RGB video is processed at 30 fps. For each frame, RTMO yields two-dimensional skeleton keypoints

(x, y)

[29] with confidences, and Depth-Anything-V2 provides a per-pixel depth map

d

[34]. As illustrated in Figure 2, the skeleton comprises fourteen landmarks (indexed 0–13)—thirteen canonical joints plus a neck landmark defined as the midpoint between the left and right shoulders. The connectivity of these keypoints is denoted by distinct colors, and the depth at each landmark is sampled from the predicted depth map. Keypoints with confidence below 0.3 are treated as missing and linearly interpolated when the gap is at most five consecutive frames; longer gaps remain missing. Per-joint trajectories are smoothed with a five-frame moving average, and coordinates are normalized by image width and height to lie in [0, 1]. For the neural module, fixed-length clips of T = 80 frames (approximately 2.5 s) are assembled with a sliding window. Shorter segments are right-padded with the last valid observation, while longer ones are truncated. From the normalized coordinates

P_{t}^{(j)} = (x_{t}^{(j)}, y_{t}^{(j)})

, we construct a coordinate stream and a motion stream using first-order differencing

{∆ P}_{t}^{(j)} = P_{t}^{(j)} - P_{t - 1}^{(j)}

(cf. Equation (3)); these two streams are then provided to the ST-GCN as described in Equations (2)–(5).

Let

q = {[u, v, d, 1]}^{T}

denote an image–depth triplet for a landmark at pixel

(u, v)

with monocular depth

d

, and let

Q = {[X, Y, Z, 1]}^{T}

be the corresponding world coordinate. We model the image–world mapping in homogeneous coordinates as

Q \approx W q

, with

W \in R^{4 \times 4}

. To estimate

W

, we collect

N

reference correspondences

{(q_{i}, Q_{i})}_{i = 1}^{N}

using K = 11 fixed control points with known world coordinates measured by the 3D scanner and their associated

(u, v, d)

samples from the RGB–depth pair (Figure 2). We then solve the following least-squares problem (SVD), with simple residual-based outlier rejection (and optional Tikhonov regularization for numerical stability):

\hat{W} = a r g \min_{W} \sum_{i = 1}^{N} {‖ Q_{i} - W q_{i} ‖}_{2}^{2}

(1)

Because monocular depth is relative, we align absolute scale using the measured floor plane at

Y_{f l o o r} = - 2.70 m

and compute neck–floor clearance as

H_{t} = Y_{t} - Y_{f l o o r}

for the symbolic rules in Section 3.3.2. Applying

\hat{W}

to each landmark triplet yields

Q_{t}^{(j)}

, from which we derive physically interpretable indicators such as absolute neck height and vertical descent rate.

3.3. Neuro-Symbolic Fall Determination

3.3.1. Neural Fall Detection Module

The neural module infers fall likelihood from two-dimensional skeleton sequences extracted by RTMO. Each sample consists of T consecutive frames with J = 14 joints per frame.

X = {{(x_{i, j}, y_{i, j})}^{T} \in R^{2} | t = 1, \dots, T, j = 1, \dots, J}

(2)

First, the two-dimensional coordinates for each joint are

(x_{t, j}, y_{t, j})

, and the frame-to-frame displacement is

(Δ x_{t, j}, Δ y_{t, j})

, computed by first-order temporal differencing as in Equation (3).

{(Δ x_{t, j}, Δ y_{t, j})}^{T} = {\begin{matrix} {(x_{t, j} - x_{t - 1, j}, y_{t, j} - y_{t - 1, j})}^{T}, \\ {(0, 0)}^{T} \end{matrix} \begin{matrix} t \geq 2 \\ t = 1 \end{matrix}}

(3)

From these quantities, two input tensors are formed: the coordinate stream

X = {(x_{t, j}, y_{t, j})}

and the motion stream

Δ X = {(Δ x_{t, j}, Δ y_{t, j})}

. Both streams are embedded on the same spatio-temporal graph.

The graph

G

consists of nodes

V

and edges

E

, where nodes index joint–frame pairs and edges encode anatomical connectivity within each frame and temporal continuity across frames (Equation (4)).

G = (V, E_{s p}, E_{t m}), V = {(t, j) | t = 1, \dots, T, j \in J}, E_{s p} = {((t, j), (t, k)) | t = 1, \dots, T, (j, k) \in E_{b o d y}}, E_{t m} = {((t - 1, j), (t, j)) | t = 2, \dots, T, j \in J}

(4)

The ST-GCN model applies spatio-temporal graph convolutions to each stream to extract posture and motion cues, and then fuses these into a unified representation (Equation (5)).

h_{s p} = g_{s p} ({x_{t, j}, y_{t, j}}; G), h_{t m} = g_{t m} ({{∆ x}_{t, j}, {∆ y}_{t, j}}; G), h = ϕ (h_{s p} | | h_{t m}),

(5)

where

| |

denotes feature concatenation,

ϕ

is the fusion projection.

Finally, the classifier outputs the fall probability

p_{N}

; the neural decision

\hat{N_{t}}

is obtained by applying a probability threshold

τ_{N}

and enforcing short-window temporal consistency controlled by

τ_{T}

(Equation (6)).

p_{N} = σ (w^{T} h + b), N_{t} = 1 [p_{N} \geq τ_{N}], \hat{N_{t}} = 1 [\frac{1}{L_{N}} \sum_{k = 0}^{L_{N} - 1} N_{t - k} \geq τ_{T}],

(6)

where σ denotes the sigmoid function,

τ_{N}

is the classification threshold,

L_{N}

is the temporal consistency threshold, and

τ_{T}

is the indicator function.

It offers high parameter efficiency and models spatio-temporal transitions explicitly, making it well-suited to real-time surveillance scenarios [29,31].

3.3.2. Symbolic Fall Detection Module

The symbolic module performs rule-based adjudication using the neck height

Y_{t}

in the world coordinate system recovered during the image–world coordinate recovery stage. The neck landmark

H_{t}

is defined as the midpoint of the left and right shoulders in two-dimensional image space

(x, y)

; its associated monocular depth

d

is then fused, and the triplet

(x, y, d)

is mapped to real-world coordinates

(X, Y, Z)

via the transformation matrix

W

described earlier (Equation (7)). The vertical component

Y_{t}

extracted from

(X, Y, Z)

serves as the primary physical indicator for the symbolic decision rule.

[\begin{matrix} \begin{matrix} X_{t} \\ Y_{t} \end{matrix} \\ \begin{matrix} Z_{t} \\ 1 \end{matrix} \end{matrix}] = W [\begin{matrix} \begin{matrix} x_{t} \\ y_{t} \end{matrix} \\ \begin{matrix} d_{t} \\ 1 \end{matrix} \end{matrix}], H_{t} \equiv Y_{t}

(7)

The symbolic decision rule comprises two conditions: (i) a floor proximity constraint based on absolute neck height, and (ii) a vertical descent rate constraint capturing rapid downward motion. The default policy requires both conditions to be satisfied simultaneously, as formalized in Equations (8) and (9).

C_{t}^{h e i g h t} = 1 [H_{t} \leq T_{Y}] .

(8)

v_{t} = \frac{H_{t} - H_{t - ∆ t}}{∆ t}, C_{t}^{v e l} = 1 [v_{t} \leq - θ] .

(9)

where

v_{t}

denotes the vertical velocity,

T_{Y}

is the height threshold,

θ

is the velocity threshold, and

∆ t

is the inter-frame interval.

To mitigate frame-level fluctuations, a short-window temporal consistency check is applied (Equation (10)), yielding the symbolic module’s final decision

\hat{S_{t}}

.

S_{t} = 1 [C_{t}^{h e i g h t} \land C_{t}^{v e l}], \hat{S_{t}} = 1 [\frac{1}{L_{S}} \sum_{k = 0}^{L_{S} - 1} S_{t - k} \geq τ_{S}]

(10)

L_{S}

is the temporal window length,

τ_{S}

is the within-window consistency threshold, and 1[⋅] denotes the indicator function.

In this process,

T_{Y}

was set to a value that robustly separates floor proximity in the height distribution observed within fall segments, and θ was chosen from the lower tail of the vertical velocity distribution. In addition, Δt,

L_{S}

, and

τ_{S}

were selected on a validation set, taking into account the frame rate and the trade-off between false positives and false negatives.

3.3.3. Neuro-Symbolic Fall Determination

The third stage fuses the neural output

\hat{N_{t}}

and the symbolic output

\hat{S_{t}}

to produce the final per-frame decision

C_{t}

. We adopt a two-step hybrid scheme: (i) a high-recall logical OR to capture either cue, followed by (ii) a short-term consistency check that down-weights isolated neural positives lacking any symbolic support within a recent window.

First, we apply a high-recall logical OR so that a frame is provisionally labeled as a fall if either the neural detector or the symbolic rules indicate a fall. This choice prioritizes sensitivity by ensuring that cues captured by one stream are not discarded when the other is uncertain.

C_{t}^{(0)} = \hat{N_{t}} \lor \hat{S_{t}}

(11)

Second, we enforce short-term temporal consistency to reduce transient false positives from the neural stream. For each frame that the neural detector marks as positive, we inspect the most recent window of

T

frames in the symbolic stream and require the presence of at least one symbolic trigger, namely floor proximity or vertical descent.

r o l l (t) = \sum_{i = t - T + 1}^{t} \hat{S_{i}}, t \geq T .

(12)

If no symbolic support is observed within the window, the provisional label is overridden to negative; otherwise, the OR result is retained.

C_{t} = {\begin{matrix} 0, \\ C_{t}^{(0)}, \end{matrix} \begin{matrix} i f \hat{N_{t}} = 1 a n d r o l l (t) = 0, \\ o t h e r w i s e . \end{matrix}

(13)

This design preserves the recall of the initial OR fusion while improving precision by suppressing isolated spikes that lack physically interpretable corroboration.

4. Experimental Setup and Evaluation

This section evaluates the proposed neuro-symbolic fall detection method that combines monocular depth estimation with skeleton cues. In a testbed designed to emulate real surveillance conditions, we compare three configurations: a neural detector, a symbolic detector, and their neuro-symbolic fusion. The neural detector is instantiated as an ST-GCN that consumes two-stream skeleton sequences extracted by RTMO [29,31]. Performance is assessed using standard quantitative metrics—accuracy, precision, recall, and F1 score—and complemented by qualitative analysis of false positives and false negatives to characterize error modes and examine the complementarity of the two components.

4.1. Experimental Environment

Evaluation was conducted in an indoor testbed measuring 3.3 m by 3.3 m with a ceiling height of 2.7 m. A single RGB camera was mounted at the ceiling corner and pitched downward by approximately 30 degrees so that most of the room fell within a single field of view. The mounting angle and field of view were configured to fully cover areas where falls are likely to occur. All surveillance videos were de-identified prior to analysis by masking facial regions to prevent recognition. No attempt at re-identification was made, and only de-identified frames were retained for modeling and evaluation. To enable accurate recovery from monocular depth to real-world coordinates, eleven fixed control points uniformly distributed on the image plane were selected. Ground-truth room dimensions were obtained using a 3D scanner, and these measurements were used to align image and world coordinates. Figure 3 illustrates (a) the camera viewpoint in the surveillance setting, (b) the spatial layout of control points for coordinate recovery, and (c) the 3D scan of the actual surveillance space.

4.2. Threshold Determination and Temporal Windowing

Thresholds were selected to balance precision–recall trade-offs and to enable replication. For floor proximity, we analyze the neck–floor clearance

H_{t}

, defined as the vertical distance between the neck and the floor plane in world coordinates. The floor plane was measured at

Y_{floor} = - 2.70 m

in our room coordinate system; to avoid sign ambiguity, we report distributions in terms of

H_{t} \geq 0

rather than raw

Y

. During confirmed fall intervals,

H_{t}

exhibited a mean of approximately

0.13 m

and a standard deviation of approximately

0.09 m

.

T_{Y}

was determined by validation, selecting the value that maximized performance over a plausible range (Figure 4). The operating point

T_{Y} = 0.19 m

yielded the best overall performance on the validation set. By design, increasing

T_{Y}

relaxes the proximity criterion and tends to increase recall at the expense of precision, whereas decreasing

T_{Y}

makes the rule more conservative and tends to increase precision while reducing recall.

For vertical descent, the threshold

θ

was set from the empirical distribution of frame-to-frame height change over interval

Δ t

(fixed by the frame rate), targeting reliable detection of rapid downward motion observed in real falls. The temporal consistency window

T

specifies the duration over which symbolic cues must persist; considering that falls typically unfold within one to two seconds and our videos were recorded at 30 frames per second, we set

T = 80

frames (about 2.5 s), which effectively suppresses isolated false positives while preserving true fall episodes.

4.3. Datasets and Labeling Protocol

The study employs two datasets: a simulation-based dataset for training and a real-world dataset for evaluation. For training, a game-engine simulation was constructed to replicate the geometry of a detention room. A single avatar executed 41 scenarios comprising 15 fall cases and 26 non-fall activities. This yielded 300 fall videos and 200 non-fall videos, for a total of 500 clips and 18,476 frames. A detailed composition of the training set is provided in Table 1.

The test dataset comprises multi-person videos recorded in a real surveillance environment. Each scene includes five individuals appearing simultaneously. For every fall scenario, exactly one of the five subjects performs a fall while the others execute non-fall activities. All recordings were de-identified prior to analysis by masking facial regions and removing potentially identifying artifacts, and only de-identified frames were retained for annotation and reporting. Frame-level labels were assigned using an any-positive rule: if at least one subject is falling in a frame, the frame is labeled as fall; otherwise, it is labeled as non-fall. In total, thirteen scenarios were defined, consisting of ten fall scenarios and three non-fall scenarios, as summarized in Table 2. Final evaluation was conducted on fifty videos, comprising thirty-six fall videos and fourteen non-fall videos, with details provided in Table 3.

4.4. Evaluation Protocol and Metrics

The ST-GCN model was trained for 50 epochs with a batch size of 256 using the Adadelta optimizer. The simulator-based dataset was split into 80% for training and 20% for validation while preserving the scenario distribution. On the validation set, we fixed the classification threshold and the temporal consistency window length for the neural module, and the height and descent-rate thresholds for the symbolic module; these parameters were then applied identically in all subsequent tests.

Evaluation on the real surveillance test set was conducted at the frame level using the any-positive rule. The confusion matrix comprises true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), where TP denotes correctly identified fall frames, TN correctly identified non-fall frames, FP non-fall frames predicted as fall, and FN fall frames predicted as non-fall. We report Accuracy, Precision, Recall, F1 score, and Matthews correlation coefficient (MCC), computed as:

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(14)

P r e c i s i o n = \frac{T P}{T P + F P}

(15)

R e c a l l = \frac{T P}{T P + F N}

(16)

F 1 - S c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(17)

M C C = \frac{T P \cdot T N - F P \cdot F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}

(18)

5. Results

5.1. Performance Characteristics of Fall Detection Methods

Figure 5 summarizes frame-level confusion matrices for the three detectors under the any-positive rule, enabling a direct comparison of error profiles (TP, TN, FP, FN) on the real surveillance test set. Figure 5a presents the confusion matrix for the neural fall detector. The method correctly identified 917 fall frames, yet exhibited a relatively high rate of FP, where non-fall frames were predicted as fall, and FN, where fall frames were missed. This pattern indicates that the ST-GCN-based approach, although effective at learning discriminative spatio-temporal patterns for fall recognition, can confuse visually similar non-fall actions with falls, leading to elevated misclassification in complex scenes. Figure 5b presents the confusion matrix for the symbolic fall detector. The method achieved very high precision on non-fall frames, yielding only one FP across the test set. This indicates a low propensity for false alarms in real surveillance conditions and suggests that the symbolic criteria provide conservative decisions for non-fall behavior. Figure 5c presents the confusion matrix for the proposed neuro-symbolic detector. Compared with the neuro-only model, FP decreased and TN increased, while FN also decreased and TP increased. Relative to the symbolic detector, FP rose modestly, but FN dropped markedly and TP increased substantially. Overall, the hybrid yields a more balanced error profile—reducing missed falls while maintaining high specificity—and thus provides more reliable fall detection in real surveillance conditions.

5.2. Comparative Performance Across Detection Methods

To assess the effectiveness of the proposed neuro-symbolic framework, we conducted a comparative evaluation of three detectors: a neural model, a symbolic model, and their neuro-symbolic integration. Quantitative performance was measured using accuracy, precision, recall, F1 score, and MCC, followed by an analysis of the limitations and complementary strengths of each method. The comparative results are summarized in Table 4.

Using ST-GCN, the neural fall detector learned discriminative spatio-temporal patterns of posture and motion, yielding an accuracy of 0.81 and an F1 score of 0.64. However, non-fall actions that mimic rapid downward motion—for example, abrupt bending—triggered frequent false alarms, which aligns with the comparatively modest precision of 0.68 and reflects reliance on learned spatio-temporal cues without explicit physical constraints. The symbolic detector, driven by floor proximity and vertical descent cues, achieved high precision by clearly rejecting non-fall frames but tended to miss true falls, indicating limited standalone suitability for safety-critical deployment. The proposed neuro-symbolic fusion substantially improved overall performance by combining physically interpretable rules with a sensitive neural detector. The fused system attained the best results among all variants (accuracy 0.88; F1 score 0.76), with gains attributable to complementary error profiles: symbolic rules suppress false positives produced by the neural stream on non-fall frames, whereas the neural stream recovers true falls that rule-based logic alone would miss.

A radar plot of the main evaluation metrics (Figure 6) illustrates this effect: the neuro-symbolic model exhibits the most balanced profile across accuracy, precision, recall, and F1-score, indicating an improvement that is not confined to a single metric but reflected consistently across measures. Overall, in real-time surveillance—where balancing precision and recall is critical—the proposed neuro-symbolic fusion provides the most favorable trade-off for fall detection. These results indicate that the approach is well-suited for deployment and can maintain high reliability and accuracy in practical settings.

5.3. Method-Wise Error Analysis

Each fall detection method exhibits characteristic failure modes in specific situations—manifesting as false positives and false negatives—and we qualitatively analyzed these cases. Figure 7a illustrates a typical FP from the neural fall detection method. The detector often misclassifies seated-to-stand transitions, static sitting, or quiet standing as falls. This tendency appears to arise from reliance on learned sequence patterns in ST-GCN; previously unseen motion configurations or modest kinematic changes can be interpreted as fall-like because the decision is driven primarily by spatio-temporal dynamics. Figure 7b shows a representative FN from the symbolic method. Falls were sometimes labeled as non-fall when the recovered neck height along the vertical axis did not drop below the floor proximity threshold, or when the estimated vertical descent rate failed to exceed the velocity threshold during the evaluation interval. These cases highlight the conservative nature of strict rule-based adjudication. Figure 7c presents the remaining errors under the neuro-symbolic fusion. Although overall performance improved, two error types persisted. First, a fall flagged by the neural stream can be revised to non-fall when symbolic evidence is absent within the short temporal window required by the AND policy. Second, non-fall frames can be labeled as fall when either stream triggers a positive under the OR policy.

These observations clarify complementary strengths and limitations of the two components and explain how the fusion mechanism operates in practice. Future work will reduce residual errors by quantitatively profiling error modes, tuning thresholds with principled optimization, and exploring context-adaptive weighting between neural and symbolic evidence.

6. Discussion and Future Work

This study proposed a neuro-symbolic fall detection framework that operates on monocular RGB surveillance video by reconstructing three-dimensional cues from depth estimation and combining them with skeleton-based neural inference. The approach targets practical constraints in real deployments, where multiple people appear simultaneously, and occlusion and overlap degrade pose quality. By recovering absolute height and vertical descent rate for a key anatomical point and enforcing explicit, physically interpretable rules, the symbolic component complements a skeleton-driven spatio-temporal neural detector. The resulting fusion reduces both false alarms and missed detections, while providing transparent decision grounds suitable for safety-critical monitoring.

Empirically, the method separated individuals in crowded scenes, stabilized pose estimation through depth-assisted reasoning, and derived three-dimensional joint coordinates for rule construction. Floor proximity and descent-rate criteria captured rapid, physically plausible fall dynamics, whereas the neural stream captured a broader range of posture and motion patterns. Logical fusion capitalized on complementary error profiles: symbolic rules constrained over-sensitive neural responses to non-fall actions, and the neural stream recovered true falls that were marginal with respect to the physical thresholds. These properties indicate suitability for real surveillance scenarios that require both accuracy and explainability.

There are several limitations. First, the evaluation used a real-world dataset collected in a controlled detention room setting rather than a public benchmark, which limits direct quantitative comparison with prior work; at the same time, the dataset reflects deployment conditions such as multi-person presence and frequent occlusion. Second, the system used a single fixed monocular camera and produced frame-level labels under an any-positive rule without explicit subject-level event attribution. Third, depth estimation provided a relative scale that was aligned to world coordinates through a finite set of control points; residual calibration error and scene-specific geometry may bias height and velocity estimates. Fourth, threshold selection for floor proximity and descent rate was based on empirical distributions from the collected data and may require re-tuning across sites with different frame rates, camera poses, or activity profiles. Fifth, the present pipeline does not include a dedicated verification stage for a person lying on the floor, nor an explicit linkage of a prone episode to a preceding fall, both of which are important in CCTV operations.

Future work will address these points along four directions. First, benchmarking and generalization: evaluate on public multi-person datasets and conduct cross-site validation with leave-one-location-out protocols, including domain adaptation for changes in camera pose, illumination, and attire. Second, instance-level reasoning: move from frame-level adjudication to subject-specific fall events with robust tracking, identity maintenance under occlusion, and explicit temporal event boundaries. As part of this, we will add a prone-person module that combines skeleton geometry and depth-derived torso–floor angle with floor-contact evidence and short-term immobility, and an association step that links each prone episode to the most recent fall candidate within a backward window; event-level endpoints will include time-to-verify and precision–recall for fall→prone sequences. Third, model adaptivity and uncertainty: incorporate calibration-free scale recovery where feasible, adopt adaptive or Bayesian thresholding for symbolic rules, and propagate uncertainty from depth and pose streams to modulate fusion. Fourth, system integration: explore multi-camera or stereo configurations for larger spaces, on-device acceleration for edge deployment, and privacy-preserving pipelines that retain depth and skeleton abstractions while minimizing storage of raw imagery.

In summary, the proposed framework closes a gap between high-performing but opaque neural detectors and transparent but brittle rule systems by unifying them around physically meaningful three-dimensional proxies derived from monocular video. This hybrid design advances fall detection toward trustworthy deployment in complex surveillance environments and provides a foundation for broader multi-person anomaly detection.

7. Conclusions

This work proposed a neuro-symbolic fall detection framework for monocular RGB surveillance that reconstructs three-dimensional proxy cues from estimated depth, combines them with spatio-temporal skeleton inference, and fuses neural and rule-based evidence through a logical policy with short-horizon temporal consistency. In a realistic testbed, the fusion method achieved an accuracy of 0.88 and an F1 score of 0.76, improving over either a neural or symbolic module alone by simultaneously lowering false positives and missed detections. The gain derives from complementary error profiles: physically interpretable thresholds for floor proximity and vertical descent suppress spurious positives, while the neural stream recovers true falls near symbolic decision boundaries. By relying only on RGB input yet introducing depth-derived cues, the approach preserves deployment simplicity and enhances interpretability, pointing toward trustworthy fall detection in multi-person, occluded environments and providing a foundation for broader anomaly monitoring.

8. Patents

This work has led to two intellectual property filings. A data-generation methodology has been filed and granted in Korea (Application No. 10-2023-0196838; Registration No. 10-2759464). In addition, a monocular depth-estimation technique relevant to the proposed framework has been filed and is currently pending (Application No. 10-2024-0171181).

Author Contributions

Conceptualization, Y.X., B.K., and J.J.; methodology, Y.X. and B.K.; software, Y.X.; validation, Y.X., B.K., I.-N.W., and J.J.; formal analysis, Y.X.; investigation, Y.X.; resources, J.J.; data curation, Y.X. and B.K.; writing—original draft preparation, Y.X. and B.K.; writing—review and editing, I.-N.W. and J.J.; visualization, Y.X.; supervision, I.-N.W. and J.J.; project administration, J.J.; funding acquisition, J.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Commercialization Promotion Agency for R&D Outcomes (COMPA), funded by the Ministry of Science and ICT (MSIT), Korea, under grant RS-2025-02311988. It was also supported by the Information Technology Research Center (ITRC) program, supervised by the Institute for Information & Communications Technology Planning & Evaluation (IITP), and funded by MSIT, under grant IITP-2025-RS-2020-II201789. In addition, this work was supported by the Regional Innovation System & Education (RISE) program through the Seoul RISE Center, funded by the Ministry of Education (MOE) and the Seoul Metropolitan Government, under grant 2025-RISE-01-007-04.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki. Ethical review and approval were waived for this study by the Institutional Review Board of Dongguk University (Project identification code: DUIRB2025-12-03(F), date of exemption: 9 January 2026) because the research was commissioned by a national agency for the public interest and utilized fully anonymized and de-identified data, ensuring that no personal identifiable information was accessible to the researchers.

Informed Consent Statement

Written informed consent was obtained from all participants for study participation and for the publication of study images. All images and video frames were de-identified by masking facial regions prior to analysis and publication.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT (GPT-5 Thinking, OpenAI) for English-language editing and stylistic polishing. The tool was not used for study design, data collection, analysis, or interpretation. The authors have reviewed and edited all AI-assisted text and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ST-GCN	Spatio-Temporal Graph Convolutional Network
CCTV	Closed-Circuit Television
TN	True Negatives
TP	True Positives
FN	False Negatives
FP	False Positives
MCC	Matthews Correlation Coefficient

References

Ahn, S.; Kim, D. Factors affecting the degree of harm from fall incidents in hospitals. J. Korean Acad. Nurs. Adm. 2021, 27, 334–343. [Google Scholar] [CrossRef]
World Health Organization. Falls. 2021. Available online: https://www.who.int/news-room/fact-sheets/detail/falls (accessed on 15 November 2025).
Usmani, S.; Saboor, A.; Haris, M.; Khan, M.A.; Park, H. Latest research trends in fall detection and prevention using machine learning: A systematic review. Sensors 2021, 21, 5134. [Google Scholar] [CrossRef] [PubMed]
Korean National Police Agency. Nationwide Status of Detention Facilities. Available online: https://www.police.go.kr/user/bbs/BD_selectBbs.do?q_bbsCode=1025&q_bbscttSn=20230417165455926 (accessed on 15 November 2025). (In Korean)
Chen, W.; Jiang, Z.; Guo, H.; Ni, X. Fall detection based on key points of human skeleton using OpenPose. Symmetry 2020, 12, 744. [Google Scholar] [CrossRef]
Debard, G.; Mertens, M.; Deschodt, M.; Vlaeyen, E.; Devriendt, E.; Dejaeger, E.; Milisen, K.; Tournoy, J.; Croonenborghs, T.; Goedemé, T. Camera-based fall detection using real-world versus simulated data: How far are we from the solution? J. Ambient. Intell. Smart Environ. 2016, 8, 149–168. [Google Scholar] [CrossRef]
Xiao, Y.; Wang, Q.; Zhang, S.; Xue, N.; Peng, S.; Shen, Y.; Zhou, X. SpatialTracker: Tracking any 2D pixels in 3D space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 20406–20417. [Google Scholar]
Park, J.; Kim, B.; Jeong, J. An analysis of synthetic data for improving performance of skeleton-based fall-down detection models. In Proceedings of the 2024 5th International Conference on Big Data Analytics and Practices (IBDAP), Bangkok, Thailand, 23–25 August 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 89–92. [Google Scholar]
Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Kim, B.; Choi, Y.; Jeong, J. A Study on Synthetic Data Generation for Fall Detection. In Proceedings of the 20th World Congress of the International Fuzzy Systems Association (IFSA 2023), Daegu, Republic of Korea, 20–24 August 2023; pp. 239–245. [Google Scholar]
Kong, Y.; Tao, Z.; Fu, Y. Deep sequential context networks for action prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1473–1481. [Google Scholar]
Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. SlowFast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
Nadeem, A.; Jalal, A.; Kim, K. Accurate physical activity recognition using multidimensional features and Markov model for smart health fitness. Symmetry 2020, 12, 1766. [Google Scholar] [CrossRef]
Liu, J.; Luo, J.; Shah, M. Recognizing realistic actions from videos in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 1996–2003. [Google Scholar]
Jalal, A.; Uddin, M.Z.; Kim, T.-S. Depth video-based human activity recognition system using translation and scaling invariant features for life logging at smart home. IEEE Trans. Consum. Electron. 2012, 58, 863–871. [Google Scholar] [CrossRef]
Park, S.; Ji, M.; Chun, J. A method for 3D human pose estimation based on 2D keypoint detection using RGB-D information. J. Internet Comput. Serv. 2018, 19, 41–51. [Google Scholar]
Xu, Y.; Chen, J.; Yang, Q.; Guo, Q. Human posture recognition and fall detection using Kinect V2 camera. In Proceedings of the 2019 Chinese Control Conference (CCC), Guangzhou, China, 27–30 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 8488–8493. [Google Scholar]
Kong, X.; Meng, Z.; Meng, L.; Tomiyama, H. Three-states-transition method for fall detection algorithm using depth image. J. Robot. Mechatron. 2019, 31, 88–94. [Google Scholar] [CrossRef]
Khalid, N.; Gochoo, M.; Jalal, A.; Kim, K. Modeling two-person segmentation and locomotion for stereoscopic action identification: A sustainable video surveillance system. Sustainability 2021, 13, 970. [Google Scholar] [CrossRef]
Shahroudy, A.; Liu, J.; Ng, T.-T.; Wang, G. NTU RGB+D: A large scale dataset for 3D human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
Coppola, C.; Cosar, S.; Faria, D.R.; Bellotto, N. Social activity recognition on continuous RGB-D video sequences. Int. J. Soc. Robot. 2020, 12, 201–215. [Google Scholar] [CrossRef]
Choi, W.; Shahid, K.; Savarese, S. What are they doing? Collective activity classification using spatio-temporal relationship among people. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision Workshops (ICCV Workshops), Kyoto, Japan, 27 September–4 October 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 1282–1289. [Google Scholar]
Pramerdorfer, C.; Planinc, R.; Van Loock, M.; Fankhauser, D.; Kampel, M.; Brandstötter, M. Fall detection based on depth data in practice. In Computer Vision—ECCV 2016 Workshops; Springer: Berlin/Heidelberg, Germany, 2016; pp. 195–208. [Google Scholar]
Ballester, I.; Gall, M.; Münzer, T.; Kampel, M. Depth-based interactive assistive system for dementia care. J. Ambient. Intell. Humaniz. Comput. 2024, 15, 3901–3912. [Google Scholar] [CrossRef]
Osokin, D. Real-time 2D multi-person pose estimation on CPU: Lightweight Openpose. arXiv 2018, arXiv:1811.12004. Available online: https://arxiv.org/abs/1811.12004 (accessed on 15 November 2025).
Jalal, A.; Kamal, S.; Azurdia-Meza, C.A. Depth maps-based human segmentation and action recognition using full-body plus body color cues via recognizer engine. J. Electr. Eng. Technol. 2019, 14, 455–461. [Google Scholar] [CrossRef]
Jalal, A.; Kamal, S.; Kim, D. Individual detection-tracking-recognition using depth activity images. In Proceedings of the 2015 12th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI); IEEE: Piscataway, NJ, USA, 2015; pp. 450–455. [Google Scholar]
Li, W.; Zhang, Z.; Liu, Z. Action recognition based on a bag of 3D points. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), San Francisco, CA, USA, 13–18 June 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 9–14. [Google Scholar]
Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. arXiv 2018, arXiv:1801.07455. [Google Scholar] [CrossRef]
Xie, J.; Zhao, Y.; Meng, Y.; Zhao, H.; Nguyen, A.; Zheng, Y. Are spatial-temporal graph convolution networks for human action recognition over-parameterized? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 11–15 June 2025; pp. 24309–24319. [Google Scholar]
Lu, P.; Jiang, T.; Li, Y.; Li, X.; Chen, K.; Yang, W. RTMO: Towards high-performance one-stage real-time multi-person pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 1491–1500. [Google Scholar]
Zhou, K.; Liu, Z.; Qiao, Y.; Xiang, T.; Loy, C.C. Domain generalization: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 4396–4415. [Google Scholar] [CrossRef] [PubMed]
Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef] [PubMed]
Garcez, A.D.; Lamb, L.C. Neurosymbolic AI: The 3rd wave. Artif. Intell. Rev. 2023, 56, 12387–12406. [Google Scholar] [CrossRef]
Ma, Y.; Wang, Y.; Wu, Y.; Lyu, Z.; Chen, S.; Li, X.; Qiao, Y. Visual knowledge graph for human action reasoning in videos. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 4132–4141. [Google Scholar]
Ji, J.; Krishna, R.; Fei-Fei, L.; Niebles, J.C. Action Genome: Actions as compositions of spatio-temporal scene graphs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10236–10247. [Google Scholar]
Zhuo, T.; Cheng, Z.; Zhang, P.; Wong, Y.; Kankanhalli, M. Explainable video action reasoning via prior knowledge and state transitions. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 521–529. [Google Scholar]

Figure 1. Overview of the proposed neuro-symbolic fall detection framework.

Figure 2. Skeleton landmark definition used in this study.

Figure 3. (a) Example camera viewpoint in the surveillance environment; (b) spatial arrangement of control points for image–world coordinate recovery; (c) 3D scan of the deployment space.

Figure 4. Effect of the floor proximity threshold (

T_{Y})

on fall detection performance.

Figure 4. Effect of the floor proximity threshold (

T_{Y})

on fall detection performance.

Figure 5. Confusion matrices for (a) the neural fall detector, (b) the symbolic fall detector, and (c) the neuro-symbolic fall detector.

Figure 6. Radar chart comparing performance metrics—accuracy, precision, recall, and F1 score across the three fall detection approaches (neural, symbolic, and neuro-symbolic).

Figure 7. Representative error cases by approach: (a) neural detector, (b) symbolic detector, and (c) neuro-symbolic fusion. Each panel illustrates a typical scene configuration that triggers misclassification.

Table 1. Composition of the simulator-based training dataset (units: count).

Category	Videos	Total Frames
Fall	300	10,835
Non-fall	200	7641
Total	500	18,476

Table 2. Behavior scenarios in the real surveillance environment.

Category	Scenario
Fall	Forward collapse
	Backward collapse
	Lateral collapse
	Fall after sudden chest pain while walking
	Trip-induced forward fall
	Fight: one subject falls forward
	Fight: one subject falls backward
	Fall after acute head pain while clutching the head
	Shoulder bump while walking leading to a fall
	Slip-induced fall on the floor
Non-fall	Sitting
	Standing
	Walking

Table 3. Composition of the real surveillance test dataset (units: count).

Category	Videos	Total Frames
Fall	36	2590
Non-fall	14	2804
Total	50	5394

Table 4. Comparative performance of fall detection methods on the real surveillance test set.

Method	Accuracy	Precision	Recall	F1-Score	MCC
Neural	0.81	0.68	0.60	0.64	0.51
Symbolic	0.77	0.99	0.21	0.35	0.40
Neuro-symbolic	0.88	0.88	0.65	0.76	0.70

Boldface indicates the highest value in each column.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xu, Y.; Kim, B.; Wang, I.-N.; Jeong, J. A Neuro-Symbolic Approach to Fall Detection via Monocular Depth Estimation. Appl. Sci. 2026, 16, 1895. https://doi.org/10.3390/app16041895

AMA Style

Xu Y, Kim B, Wang I-N, Jeong J. A Neuro-Symbolic Approach to Fall Detection via Monocular Depth Estimation. Applied Sciences. 2026; 16(4):1895. https://doi.org/10.3390/app16041895

Chicago/Turabian Style

Xu, Yinghai, Bongjun Kim, In-Nea Wang, and Junho Jeong. 2026. "A Neuro-Symbolic Approach to Fall Detection via Monocular Depth Estimation" Applied Sciences 16, no. 4: 1895. https://doi.org/10.3390/app16041895

APA Style

Xu, Y., Kim, B., Wang, I.-N., & Jeong, J. (2026). A Neuro-Symbolic Approach to Fall Detection via Monocular Depth Estimation. Applied Sciences, 16(4), 1895. https://doi.org/10.3390/app16041895

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Neuro-Symbolic Approach to Fall Detection via Monocular Depth Estimation

Abstract

1. Introduction

2. Related Work

2.1. RGB-Only Appearance and Motion Approaches

2.2. Depth-Augmented Approaches

2.3. Skeleton-Based Spatio-Temporal Approaches

2.4. Knowledge- and Logic-Augmented Approaches

3. Proposed Fall Detection Method

3.1. Overview of the Proposed Fall Detection Method

3.2. Image–World Coordinate Recovery

3.3. Neuro-Symbolic Fall Determination

3.3.1. Neural Fall Detection Module

3.3.2. Symbolic Fall Detection Module

3.3.3. Neuro-Symbolic Fall Determination

4. Experimental Setup and Evaluation

4.1. Experimental Environment

4.2. Threshold Determination and Temporal Windowing

4.3. Datasets and Labeling Protocol

4.4. Evaluation Protocol and Metrics

5. Results

5.1. Performance Characteristics of Fall Detection Methods

5.2. Comparative Performance Across Detection Methods

5.3. Method-Wise Error Analysis

6. Discussion and Future Work

7. Conclusions

8. Patents

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI