An Enhanced LSTM with Hippocampal-Inspired Episodic Memory for Urban Crowd Behavior Analysis

An, Mingshou; Lim, Hye-Youn; Kang, Dae-Seong

doi:10.3390/electronics15010101

Open AccessArticle

An Enhanced LSTM with Hippocampal-Inspired Episodic Memory for Urban Crowd Behavior Analysis

by

Mingshou An

¹

,

Hye-Youn Lim

² and

Dae-Seong Kang

^2,*

¹

School of Computer Science and Engineering, Xi’an Technological University, Xi’an 710021, China

²

Department of Electronics Engineering, Dong-A University, Busan 49315, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(1), 101; https://doi.org/10.3390/electronics15010101 (registering DOI)

Submission received: 15 November 2025 / Revised: 18 December 2025 / Accepted: 19 December 2025 / Published: 25 December 2025

(This article belongs to the Special Issue Artificial Intelligence and Deep Learning for Smart Sensor and Smart Mobility)

Download

Browse Figures

Versions Notes

Abstract

The increasing frequency and severity of urban crowd disasters underscore a critical need for intelligent surveillance systems capable of real-time crowd anomaly detection and early warning. While deep learning models such as LSTMs, ConvLSTMs, and Transformers have been applied to video-based crowd anomaly detection, they often face limitations in long-term contextual reasoning, computational efficiency, and interpretability. To address these challenges, this paper proposes HiMeLSTM, a crowd anomaly detection framework built around a hippocampal-inspired memory-enhanced LSTM backbone that integrates Long Short-Term Memory (LSTM) networks with an Episodic Memory Unit (EMU). This hybrid design enables the model to effectively capture both short-term temporal dynamics and long-term contextual patterns essential for understanding complex crowd behavior. We evaluate HiMeLSTM on two publicly available crowd-anomaly benchmark datasets (UCF-Crime and ShanghaiTech Campus) and an in-house CrowdSurge-1K dataset, demonstrating that it consistently outperforms strong baseline architectures, including Vanilla LSTM, ConvLSTM, a lightweight spatial–temporal Transformer, and recent reconstruction-based models such as MemAE and ST-AE. Across these datasets, HiMeLSTM achieves up to 93.5% accuracy, 89.6% anomaly detection rate (ADR), and a 0.89 F1-score, while maintaining computational efficiency suitable for real-time deployment on GPU-equipped edge devices. Unlike many recent approaches that rely on multimodal sensors, optical-flow volumes, or detailed digital twins of the environment, HiMeLSTM operates solely on raw CCTV video streams combined with a simple manually defined zone layout. Furthermore, the hippocampal-inspired EMU provides an interpretable memory retrieval mechanism: by inspecting the retrieved episodes and their att ention weights, operators can understand which past crowd patterns contributed to a given decision. Overall, the proposed framework represents a significant step toward practical and reliable crowd monitoring systems for enhancing public safety in urban environments.

Keywords:

crowd detection; LSTM; EMU; ConvLSTM; ADR; vision transformer

1. Introduction

Recent large-scale crowd disasters, such as the Itaewon tragedy in Korea, highlight the urgent necessity of proactive crowd behavior analysis and advanced surveillance solutions in modern cities [1,2]. Comparable accidents have been reported globally—for instance, the 2015 Shanghai Bund stampede in China, which caused 36 deaths [3]; the 2013 Allahabad Kumbh Mela incident in India, leaving more than 36 fatalities and over 100 injuries [4]; and multiple Hajj-related stampedes in the Middle East, with hundreds of victims due to uncontrolled surges [5]. These examples illustrate the worldwide demand for intelligent monitoring systems that can provide real-time detection, timely alerts, and early intervention to mitigate casualties [6].

Detecting anomalies in crowd dynamics requires the capability to capture complex temporal dependencies and to identify irregular behavioral patterns across heterogeneous inputs such as CCTV footage and mobility sensor data [7,8,9]. Recurrent neural networks (RNNs) [10,11], particularly Long Short-Term Memory (LSTM) architectures, have shown effectiveness in time-series analysis tasks, including activity recognition [12], traffic flow prediction [13], and video-based anomaly detection [14]. Nevertheless, traditional LSTM models face challenges in retaining information over long horizons, especially in highly dynamic and disordered urban environments. To overcome these drawbacks, researchers have introduced improved architectures. ConvLSTM [15], for example, incorporates convolutional operations to learn spatial-temporal correlations in sequential data, enhancing its applicability to video analysis. More recently, Transformer models [16] with self-attention mechanisms have achieved superior performance in sequence learning; however, they demand substantial computational resources and large datasets, and their interpretability in safety-critical applications is still debatable.

At the same time, advances in neuroscience-inspired computing have introduced models such as Episodic Memory Networks [17], Memory-Augmented Neural Networks [18], and Neural Turing Machines [7,19], which are inspired by hippocampal functions in the human brain [20,21]. These models demonstrate the utility of episodic memory for capturing and recalling contextual sequences [22]. Despite these developments, the practical adoption of hippocampal-inspired memory modules for real-world surveillance of crowds remains scarce [23].

Existing approaches still suffer from several limitations:

Labeled anomaly datasets, which hampers model generalization.
High infrastructure costs, particularly when deploying multi-sensor or multi-camera systems.
Limited capability for real-time deployment due to the computational overhead of state-of-the-art methods.
Lack of interpretability, which weakens user trust in mission-critical scenarios.

To bridge these gaps, we introduce Hippocampal-Inspired Memory-Enhanced LSTM (HiMeLSTM). HiMeLSTM strengthens the temporal modeling power of LSTMs with a hippocampus-inspired Episodic Memory Unit (EMU) that can store and retrieve contextual episodes over extended durations. By operating on zone-level descriptors extracted from raw CCTV footage, HiMeLSTM is able to model both short-term motion patterns and long-range contextual dependencies that are critical for distinguishing between benign crowd formations and genuinely dangerous behaviors.

The main contributions of this work are summarized as follows:

Memory-Augmented Recurrent Backbone: We propose HiMeLSTM, a hippocampal-inspired memory-enhanced LSTM that integrates an Episodic Memory Unit (EMU) with an LSTM encoder. The EMU stores compact episode representations with spatial and temporal context, enabling long-term recall that goes beyond standard LSTM and ConvLSTM architectures.
Zone-Level Spatial Context Encoding: We introduce a zone-level spatial mapping module that partitions each scene into semantically meaningful regions and aggregates CNN features into zone descriptors. This design provides structured spatial context for crowd behavior and facilitates interpretable analysis at the zone level.
Performance and Interpretability: Experiments on benchmark datasets built on the HiMeLSTM backbone, consistently outperforms conventional LSTM, ConvLSTM, and Transformer baselines in terms of accuracy, anomaly detection rate, and F1-score. At the same time, the EMU and its attention weights offer a transparent mechanism for identifying which past episodes contributed to each decision.
Real-Time Feasibility for CCTV-Based Surveillance: The overall architecture is lightweight and computationally efficient, relying solely on standard visual data with zone-level mapping. This eliminates the need for expensive multi-sensor infrastructure and makes the framework suitable for real-time deployment in existing CCTV-based urban surveillance systems.

2. Related Work

2.1. LSTM-Based Anomaly Detection

Early studies on anomaly detection have widely adopted Recurrent Neural Networks (RNNs), particularly LSTM models, for modeling sequential dependencies in crowd dynamics. LSTMs have been employed for human activity recognition, abnormal trajectory detection, and safety monitoring tasks. While these models capture short- and mid-term temporal patterns effectively, they often suffer from information decay when handling long-term dependencies, which reduces their robustness in highly dynamic crowd environments.

2.2. ConvLSTM Applications

To address the limitation of traditional LSTMs, ConvLSTM was introduced by Shi et al. [24]. as an architecture capable of jointly learning spatial and temporal features. ConvLSTM has been successfully applied to video anomaly detection, traffic flow prediction, and weather forecasting, where spatiotemporal correlations are critical. In the context of crowd monitoring, ConvLSTM helps capture local motion patterns, such as sudden increases in density or abnormal group behaviors. However, ConvLSTM still exhibits constraints in terms of computational cost and limited ability to encode long-range dependencies [25].

2.3. Transformer-Based Crowd Detection

More recently, Transformer models have gained attention for sequence modeling tasks due to their self-attention mechanisms that provide global context modeling. Vision Transformers (ViTs) [26] and their variants have demonstrated high accuracy in action recognition and crowd anomaly detection, often outperforming RNN-based methods in terms of detection precision. Hybrid architectures that combine convolutional backbones with Transformer layers, such as feature cross-layer interaction hybrid methods based on Res2Net and Transformer (e.g., FCIHMRT-type models) [27], further improve multi-scale representation and long-range dependency modeling in dense and complex scenes.

Despite their strong performance, these Transformer-based methods [28] typically rely on large-scale labeled datasets and significant computational resources, which may limit their practicality for real-time video surveillance. In addition, the interpretability of self-attention patterns in safety-critical applications remains an open issue. In contrast, our HiMeLSTM adopts a memory-augmented recurrent backbone that is deliberately lightweight and specifically designed to provide interpretable episodic recall via the EMU while still achieving competitive detection performance on benchmark datasets.

In parallel to Transformer-based detectors, Dong et al. [29] proposed the Learning Temporal distribution and Spatial correlation (LTS) framework for universal moving object segmentation. LTS explicitly models temporal pixel distributions and refines them with a Bayesian spatial correlation module, thereby decoupling temporal distribution learning from spatial correlation reasoning. Although its target task is generic moving object segmentation rather than crowd anomaly detection, LTS shares with our work the idea of jointly exploiting temporal dynamics and structured spatial context. Our method differs in that it employs a memory-augmented recurrent backbone at the zone level, operates on raw CCTV streams without pixel-wise segmentation, and outputs anomaly scores for crowd-safety monitoring instead of foreground masks. Nevertheless, the principles behind LTS suggest promising directions for future extensions of our Episodic Memory Unit toward more fine-grained spatial reasoning.

2.4. Neuroscience-Inspired Architectures

Neuroscience has motivated several architectures that emulate human memory functions. For instance:

Neural Turing Machines (NTMs) [7,8] extend neural networks with differentiable memory access.
Memory-Augmented Neural Networks (MANNs) [30] enhance sequence learning with external memory modules.
Episodic Memory Networks [17,31] integrate contextual memory storage and retrieval mechanisms inspired by the hippocampus.

These approaches demonstrate the potential of memory-inspired computation to retain and recall contextual information. Yet, their adoption in real-time crowd anomaly detection is still limited, leaving room for novel architectures that combine neural sequence learning with biologically inspired memory.

2.5. Comparative Summary

The following table summarizes the differences between conventional approaches and the proposed HiMeLSTM model.

3. Proposed Method

HiMeLSTM denotes the end-to-end crowd anomaly detection pipeline, whereas refers to the core Hippocampal-inspired Memory-enhanced LSTM backbone that couples an LSTM encoder with the Episodic Memory Unit (EMU). At a high level, HiMeLSTM maps an input CCTV video sequence to a sequence of anomaly scores as follows: (i) a CNN encoder extracts appearance- and motion-sensitive feature maps from each frame; (ii) a zone-level spatial mapping module aggregates these feature maps into descriptors for manually defined zones; (iii) the sequence of zone descriptors is processed by the HiMeLSTM backbone, where the LSTM captures short-term temporal dynamics and the EMU stores and retrieves long-term episodic memories; and (iv) an anomaly detection head combines the current hidden state with the retrieved memory context to produce a scalar anomaly score for each time step. An alert is raised whenever the anomaly score exceeds a learned threshold. Figure 1 illustrates the overall architecture and data flow. Given an input CCTV video stream, each frame is first processed by a CNN-based encoder to extract appearance and motion-sensitive feature maps. These feature maps are then partitioned into a set of manually defined spatial pooled into compact zone-level descriptors. The resulting sequence of zone descriptors is fed into the HiMeLSTM backbone, where an LSTM encoder models short-term temporal dependencies and the EMU maintains a compact set of long-term episodic memories summarizing past crowd behaviors. At each time step, the current hidden state interacts with the EMU through an attention-based retrieval mechanism that retrieves the most relevant past episodes and produces a contextualized representation. This representation is finally mapped to an anomaly score, and an alarm is raised when the score exceeds a learned threshold, triggering the event classification and alerting module. Because each memory slot stores both the temporal index and zone-level context, the EMU also provides a natural handle for interpretability: operators can inspect which past episodes when were most influential in the decision.

3.1. Input Data Modalities

Visual Data: Raw CCTV video (up to 4K resolution) is processed with CNN-based detectors to extract human-centric features such as bounding boxes, posture descriptors, and motion trajectories. Behavioral indicators such as falling, loitering, sudden running, or crowd surges are embedded into the feature vectors.
Spatial Context: Each scene is subdivided into manually pre-defined zones (entrances, exits, bottlenecks). Events are thus contextualized with respect to location, enabling localized anomaly interpretation.

3.2. CNN Encoder and Zone Mapper

The CNN encoder is responsible for extracting low- and mid-level visual features from each input frame. In our implementation, we adopt a lightweight ResNet-style backbone truncated before the global pooling and classification layers. Concretely, each input RGB frame is first resized to a fixed resolution and passed through a stack of convolutional blocks, where each block consists of a 3 × 3 convolution, batch normalization, and ReLU activation, with occasional stride-2 convolutions for spatial downsampling. The final convolutional stage produces a feature map with C channels and a reduced spatial resolution.

f_{t} = C N N (I_{t})

(1)

where

I_{t}

is the video frame at time

t

, and

f_{t}

represents extracted features such as posture orientation and movement direction.

The Zone Mapper then projects these features into predefined zones:

The Zone Mapper then projects these feature maps into a set of predefined spatial zones. The scene is partitioned into semantic or grid-based regions.

z_{t} = M (f_{t}, Z)

(2)

where

Z

denotes the spatial partitioning of the frame. This zone-wise representation enables crowd dynamics analysis and interpretability at the level of individual regions without requiring external spatial modeling platforms.

3.3. LSTM Temporal Encoder

The zone-level feature vectors are passed to the LSTM encoder to capture temporal dynamics:

h_{t}, c_{t} = L S T M (z_{t}, h_{t - 1}, c_{t - 1,})

(3)

where

h_{t}

is the hidden state and

c_{t}

is the cell state at time

t

. Each hidden state

h_{t}

becomes a temporal key that links current crowd behavior with the EMU for contextual storage and retrieval.

3.4. Episodic Memory Unit (EMU)

The EMU simulates the hippocampal role in episodic memory formation and recall.

Memory Slots: The EMU maintains a finite set of M memory slots

M = \{m_{1}, m_{2}, \dots, m_{n}\}

, where each slot stores a compressed tuple

(h_{t}, z_{t}, t)

representing the hidden state, zone descriptor, and time index of a past episode. Unless otherwise stated, we set M = 256 in all experiments, which we found to provide a good balance between contextual capacity and computational cost.

Memory Encoding: A new episode

e_{t} = (h_{t}, z_{t})

is added if the temporal variation exceeds a threshold

θ

:

{‖h_{t} - h_{t - 1}‖}_{2} > θ

(4)

here,

h_{t}

and

h_{t - 1}

denote the current and previous LSTM hidden states, respectively, and

θ > 0

is a scalar threshold that controls how much temporal change must be observed before a new episode is written into memory.

Memory Retrieval: Attention-based similarity scoring retrieves relevant episodes, given the current hidden state

h_{t}

, the EMU retrieves relevant past episodes through attention-based similarity scoring. We first compute a similarity score

s i m (h_{t}, m_{i})

between

h_{t}

and each memory slot

m_{i}

, and then normalize these scores via a softmax to obtain attention weights,

α_{i} = \frac{e x p (s i m ((h_{t}, m_{i})))}{\sum_{j} e x p (s i m ((h_{t}, m_{j})))}

(5)

where

s i m (\cdot)

can be cosine or Euclidean similarity.

Contextual Recall: contextual embedding is computed as a weighted sum over memory slots as:

\hat{h_{t}} = \sum_{i} α_{i} m_{i}

(6)

which aggregates information from the most relevant past episodes. In this expression,

α_{i}

is the attention weight assigned to memory slot

m_{i}

, and each

m_{i}

is a stored episodic memory vector. Their weighted sum

\hat{h_{t}}

represents the contextual recall that aggregates information from the most relevant past episodes. The recalled vector

\hat{h_{t}}

is then combined with the current hidden state ht and passed to the anomaly detection head. In the revised Figure 1, we explicitly depict these EMU operations—memory slot updates, similarity computation, attention weighting, and contextual recall—to make the internal mechanism more transparent.

3.5. Anomaly Detection Module

The anomaly detection module transforms the internal representation of HiMeLSTM into a scalar anomaly score for each time step. Intuitively, this score should be high when the current crowd behavior deviates significantly from previously observed normal episodes stored in the EMU.

We compute an anomaly score by comparing the current hidden state

h_{t}

with the recalled context

{\hat{h}}_{t}

from memory. In our implementation, the concatenated vector (

h_{t}, {\hat{h}}_{t}

) is passed through a small fully connected layer to produce a scalar score

A (t)

. If

A (t) > τ

, where

τ

is a learned threshold, the corresponding frame is flagged as anomalous.

A (t) = {‖h_{t} - {\hat{h}}_{t}‖}_{2}

(7)

Typical anomalies include sudden increases in crowd density (stampede-like patterns), irregular movement flows such as counter-directional motion, and unusual events like falling or loitering in restricted zones.

3.6. Architectural Details and Output Dimensions

To facilitate reproducibility and clarify the information flow in HiMeLSTM, we summarize here the main tensor shapes produced by each block in Figure 1 using symbolic notation.

Let each RGB input frame be denoted by

I_{t} = R^{H \times W \times 3}

(8)

where H and W are the image height and width, respectively. The CNN encoder transforms

I_{t}

into a feature map

f_{t} = c n n (I_{t}) \in R^{C \times H^{'} \times W^{'}}

(9)

where C denotes the number of channels and H′ × W′ is the spatial resolution of the feature map.

The scene is partitioned into K non-overlapping spatial zones (e.g., a regular grid or semantically defined regions). For each zone, we perform spatial pooling over the corresponding region in

f_{t}

, yielding a set of zone descriptors:

z_{t} = [z_{t}^{(1)}, . . ., z_{t}^{(K)}] \in R^{K \times D}

(10)

where each

z_{t}^{(K)} \in R^{D}

is a D-dimensional feature vector summarizing the local crowd dynamics in zone

k

at time

t

. The sequence of zone descriptors is then flattened and passed to the LSTM encoder:

h_{t}, c_{t} = L S T M (z_{t}, h_{t - 1}, c_{t - 1}), h_{t}, c_{t} \in R^{D_{h}}

(11)

where

D_{h}

denotes the dimensionality of the hidden and cell states. The Episodic Memory Unit(EMU) maintains a set of M memory slots

M = {m_{i}}_{i = 1}^{M}, m_{i} \in R^{D_{h}}

(12)

where each slot stores a tuple

(h_{t}, z_{t}, t)

in compressed form. During retrieval, the similarity between the current hidden state

h_{t}

and each memory slot is used to compute attention weights

α_{i}

, which in turn yield a contextual recall

\hat{h_{t}} = \sum_{i = 1}^{M} α_{i} m_{i} \in R^{D_{h}}

(13)

As in Section 3.4,

α_{i}

denotes the attention weight for the i-th memory slot and

m_{i}

is the corresponding memory vector, so that recalled representation

\hat{h_{t}}

is a weighted combination of past episodic memories consistent with Equation (6).

The anomaly detection head receives the concatenated representation

[h_{t}; \hat{h_{t}}] \in R^{{2 D}_{h}}

and outputs a scalar anomaly score

s_{t} = σ (W [h_{t}; \hat{h_{t}}] + b) \in R,

(14)

where W and b are learnable parameters and

σ (\cdot)

denotes a suitable activation function. These symbolic shapes are reflected in Figure 1 and ensure that the output dimensions of each block are explicitly defined. Each input frame

I_{t}

(1920

\times

1080) is processed by a yolo backbone, yielding features of size 2048

\times

7

\times

7. The Zone Mapper projects these into

K = 8

, spatial zones, outputting

z_{t} = \in R^{K \times 256}

.

4. Results and Discussion

4.1. Datasets

We evaluate HiMeLSTM on three benchmark datasets for crowd anomaly detection: UCF-Crime [32], ShanghaiTech Campus [33], and our in-house CrowdSurge-1K [34] dataset.

UCF-Crime (samples, as shown in the left panels of Figure 2) is a large-scale collection of real-world surveillance videos that has become a standard benchmark for video anomaly detection. It contains 1900 long, untrimmed surveillance videos (approximately 128 h in total) captured by fixed CCTV cameras in diverse outdoor and indoor environments such as streets, parking lots, and shops. The dataset covers 13 types of anomalous events, including fighting, road accidents, burglary, robbery, shooting, shoplifting, and other criminal or dangerous activities, as well as normal scenes. We follow the standard semi-supervised protocol introduced by Sultani et al.: only normal videos are used for training, while the test set contains both normal and anomalous videos with frame-level annotations. In this setting, the model must learn normal patterns and detect anomalies in previously unseen test videos.

The ShanghaiTech Campus (samples, as shown in the right panels of Figure 2) dataset focuses on anomaly detection in a university campus environment. It consists of 330 normal training videos and 107 test videos captured by multiple static cameras covering campus walkways, building entrances, and open squares at a resolution of 480 × 856. The test set includes both normal activities and a variety of anomalous events such as bicycles or motorbikes entering pedestrian areas, people fighting or chasing each other, jumping, and other irregular behaviors. Each anomalous segment is annotated at the frame level. We strictly follow the official train/test split and evaluation protocol, using only normal training videos for model fitting and computing all metrics on the test set.

CrowdSurge-1K (samples, as shown in Figure 3) is an in-house dataset designed to specifically capture high-risk crowd situations such as sudden density surges near exits, bottlenecks, and confined spaces. It contains 1000 clips (typically 5–20 s each) recorded from fixed surveillance viewpoints in urban public areas, including transportation hubs, stadium surroundings, and pedestrian streets. The dataset includes both normal crowd flows and a diverse set of surge-like anomalies, such as rapid inflow toward a narrow passage, stagnation followed by sudden movement, and disordered motion patterns that may precede crush or stampede events. We split CrowdSurge-1K into training, validation, and test sets with a ratio of 6:1:3, ensuring that scenes do not overlap between splits to test cross-scene generalization.

In all experiments, input frames are resized to a fixed spatial resolution and sampled at a common frame rate so that temporal sampling is comparable across datasets. The official ground-truth annotations provided with each dataset are used for both training supervision and quantitative evaluation.

4.2. Implementation and Evaluation Metrics

Implementation Details. All models were implemented in Python 3.12 using the PyTorch 2.2.0 framework and trained on a workstation equipped with a single NVIDIA RTX-class GPU, an 8-core CPU, and 32 GB of RAM, running a Linux operating system. Unless otherwise stated, input frames were resized to a fixed spatial resolution and sampled at a constant frame rate. We trained HiMeLSTM for 50 epochs using the Adam optimizer with an initial learning rate of 1 × 10⁻⁴, a batch size of 8 sequences, and a sequence length of 32 frames. The EMU memory size was set to M = 256 slots by default, and early stopping was applied based on validation performance. The model size and FLOPs reported in the efficiency table reflect this final trained configuration.

Evaluation Metrics. To quantitatively assess the performance of HiMeLSTM and the baselines, we employ standard classification and detection metrics commonly used in video anomaly detection: accuracy, precision, recall, F1 score, Receiver Operating Characteristic Area Under the Curve (ROC-AUC), Precision–Recall Area Under the Curve (PR-AUC), and Anomaly Detection Rate (ADR). If

T P

,

F P

,

T N

, and

F N

denote the numbers of true positives, false positives, true negatives, and false negatives, respectively, at the chosen decision threshold, accuracy, precision, recall, and F1-score are defined as

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(15)

P r e c i s i o n = \frac{T P}{T P + F P}, R e c a l l = \frac{T P}{T P + F N}

(16)

F 1 = 2 \cdot \frac{T P}{T P + T N + F P + F N}

(17)

ROC-AUC and PR-AUC are obtained by varying the decision threshold over the full range of anomaly scores and computing the area under the corresponding ROC and precision–recall curves, respectively.

The Anomaly Detection Rate (

A D R

) focuses specifically on the correct detection of anomalous events. It is defined as

A D R = \frac{# \{g r o u n d - t r u t h a n o m a l o u s s a m p l e c o r r e c t l y f l a g g e d a s a n o m a l o u s\}}{# \{g r o u n d - t r u t h a n o m a l o u s s a m p l e s\}}

(18)

In our experiments, “

s a m p l e s

” correspond to [clips/frames], depending on the granularity of the dataset annotations. We compute ADR by comparing the predicted anomaly labels with the ground-truth labels for all anomalous samples in the test set. This metric directly reflects the model’s sensitivity to true anomalies and complements the threshold-independent ROC-AUC and PR-AUC scores.

4.3. Detection Performance

The primary objective of any anomaly detection system is to achieve high accuracy with a balanced sensitivity to positive cases. As quantitatively summarized in Table 1, HiMeLSTM demonstrates superior performance across all key metrics, establishing a new state-of-the-art for the task of crowd anomaly detection.

As shown in Table 1, conventional LSTM-based models focus on sequential temporal modeling and are effective for capturing short-term dependencies, but they suffer from weak long-term memory retention and limited spatial awareness. ConvLSTM models extend LSTMs by incorporating convolutional operations to model spatiotemporal patterns in video data; however, this improvement comes at the cost of increased computational complexity and difficulty in capturing global contextual information. Transformer-based approaches, leveraging self-attention mechanisms, excel at long-range dependency modeling and often achieve high detection accuracy, but they typically require large-scale datasets, incur high computational overhead, and offer limited interpretability in safety-critical applications.

Neuroscience-inspired architectures, such as Neural Turing Machines (NTM), Memory-Augmented Neural Networks (MANN), and other episodic memory-based models, introduce external memory components that enable contextual retention and biologically plausible reasoning. Despite their conceptual advantages, these methods often involve complex training procedures and have seen limited adoption in real-world crowd surveillance scenarios.

In contrast, the proposed HiMeLSTM integrates an Episodic Memory Unit (EMU) with an LSTM backbone, explicitly addressing the shortcomings of existing approaches. As summarized in Table 1, HiMeLSTM combines strong long-term memory retention with interpretable episodic recall while remaining lightweight enough for real-time deployment. Although it requires careful zone definition and further validation across diverse environments, its design offers a balanced trade-off between performance, interpretability, and computational efficiency.

Quantitatively, HiMeLSTM improves the Anomaly Detection Rate (ADR) by 4.3% over Vanilla LSTM, 2.9% over ConvLSTM, and 4.3% over the Transformer baseline. This performance gain can be directly attributed to the episodic memory mechanism, which enables the model to retrieve and leverage relevant contextual patterns from distant past events. Such capability is particularly important in crowd scenarios, where anomalous behavior is often preceded by subtle and temporally dispersed cues. Furthermore, HiMeLSTM achieves the highest F1-score of 0.89, reflecting an effective balance between precision and recall and underscoring its suitability for real-world surveillance and early-warning applications.

4.4. Computational Efficiency

For real-world deployment, especially on edge devices or systems requiring real-time analysis, computational efficiency is as important as accuracy. A model must be not only accurate but also fast and lightweight.

Table 2 shows the quantitative performance comparison of different crowd anomaly detection models in terms of Accuracy, Anomaly Detection Rate (ADR), and F1-score. Conventional sequential models such as Vanilla LSTM and ConvLSTM show reasonable performance but suffer from limited sensitivity to complex or long-term anomalies. Transformer-based models achieve higher accuracy and ADR by leveraging global attention, yet the improvement is relatively modest considering their higher computational cost.

Reconstruction-based approaches, including MemAE and ST-AE, further improve detection performance by learning compact representations of normal behavior, resulting in higher ADR and F1-scores than recurrent baselines. However, these methods still fall short in capturing long-range temporal dependencies critical for early anomaly detection.

The proposed HiMeLSTM consistently outperforms all compared baselines across all evaluation metrics, achieving the highest accuracy (93.5%), ADR (89.6%), and F1-score (0.89). This performance gain demonstrates the effectiveness of the hippocampal-inspired Episodic Memory Unit in preserving long-term contextual information while maintaining robustness to diverse anomaly patterns. Overall, the results indicate that HiMeLSTM provides a superior balance between detection accuracy and practical deployability for real-time crowd surveillance systems.

Table 3 shows a comparison of the computational efficiency of different crowd anomaly detection models in terms of training cost (GPU-hours), model size, floating-point operations (FLOPs), and inference speed measured in frames per second (FPS) at 1080p resolution. The Vanilla LSTM model exhibits the lowest computational cost and highest inference speed; however, this efficiency comes at the expense of limited detection performance, as shown in the accuracy results.

The Transformer-based model incurs the highest computational overhead, with substantially larger model size, FLOPs, and training cost, resulting in the lowest real-time inference speed. Such characteristics make it less suitable for deployment in resource-constrained or real-time surveillance environments.

The proposed model achieves a favorable trade-off between efficiency and performance. While slightly more computationally demanding than the Vanilla LSTM, it significantly reduces the overhead compared to the Transformer model and maintains a high inference speed of 38.4 FPS. These results indicate that the proposed approach is well suited for real-time crowd anomaly detection on GPU-equipped edge devices, offering both strong detection capability and practical computational efficiency.

These results confirm that HiMeLSTM achieves superior accuracy with moderate computational cost, making it suitable for real-time deployment on resource-constrained devices.

4.5. ROC and Precision-Recall Analysis

To evaluate model robustness beyond fixed-threshold metrics, we analyzed the Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves. These analyses assess performance across all possible classification thresholds.

As shown in Figure 4, HiMeLSTM consistently dominates the other methods across the entire operating range. Specifically, HiMeLSTM achieves the highest area under the curve in both evaluations, with a ROC-AUC of 0.95 compared to 0.91 for the Transformer, 0.89 for ConvLSTM, and 0.87 for LSTM. Similarly, in the PR analysis, HiMeLSTM attains a PR-AUC of 0.93, outperforming the Transformer (0.89), ConvLSTM (0.86), and LSTM (0.83).

The superior ROC-AUC indicates that HiMeLSTM has a strong overall ability to discriminate between normal and anomalous crowd behaviors, regardless of the decision threshold. More importantly, the consistently higher PR curve demonstrates that HiMeLSTM maintains high precision even at high recall levels, meaning that it can detect a large proportion of true anomalies while keeping false alarms low. This behavior reflects the benefit of the episodic memory mechanism, which enables robust contextual reasoning over long temporal horizons. Overall, the ROC and PR analyses confirm that HiMeLSTM provides stable and reliable anomaly detection performance under varying operational conditions, making it well suited for real-world crowd monitoring applications.

4.6. Ablation Study

To deconstruct the contribution of each architectural component in HiMeLSTM, we conducted a series of ablation experiments. The results, summarized in Figure 5, validate our design choices.

(1): Without Episodic Memory Unit (EMU): Removing the core EMU module led to a significant drop of 3.8% in ADR. This result directly confirms the hypothesis that contextual recall beyond immediate sequences is vital for accurate anomaly detection. The model without EMU degenerates to a more standard architecture, losing its ability to link current observations with past, similar contexts.
(2): Without Zone Mapping: Eliminating the zone-wise spatial mapping caused a 2.4% decrease in overall accuracy. This degradation was most pronounced in areas with high crowd density or overlapping trajectories. The result underscores that zone mapping is not merely for interpretability; it provides a structured spatial prior that helps the model disambiguate local patterns, thereby improving feature learning and detection accuracy.
(3): Varying Memory Slot Size: We investigated the impact of the EMU’s memory capacity. The experiment revealed a clear trade-off: an insufficient number of slots (128) led to information loss and poor performance as critical past contexts could not be stored, while an excessive number (512) resulted in increased computational overhead and potential overfitting to noise. An optimal balance was found at 256 memory slots, which provided sufficient capacity for context retention without undue computational cost.

In Figure 5, the left sub-figure compares configurations with and without the Episodic Memory Unit (EMU) and zone-level mapping. The right sub-figure shows how accuracy, ADR, and F1-score change as the number of memory slots varies; the three broken lines in the plot correspond to accuracy, ADR, and F1-score, respectively, as indicated in the legend. The three curves represent different ablation settings:

Top line: Full model (HiMeLSTM with EMU and zone mapping).
Middle line: HiMeLSTM without zone mapping.
Bottom line: HiMeLSTM without the Episodic Memory Unit (EMU).

The results show that removing EMU leads to the largest performance degradation across ADR, Accuracy, and F1-score, while removing zone mapping causes a moderate but consistent decline, confirming the importance of both components.

As summarized in Table 4, removing the EMU leads to a noticeable drop in anomaly detection performance, confirming the importance of long-term episodic recall. Eliminating the zone mapping mainly affects accuracy in dense or overlapping crowd regions, highlighting the role of structured spatial context. Varying the memory capacity reveals that too few slots cause information loss, while too many slots increase computational overhead without clear benefits; 256 slots offer a favorable balance between performance and efficiency.

4.7. Error Case Analysis

Despite its strong performance, HiMeLSTM is not infallible. A qualitative analysis of failure cases provides valuable insights into its limitations and outlines promising avenues for future research. The primary error types are categorized and illustrated in Figure 6.

Error Analysis:

(1): False Positives in Structured Crowds: The model occasionally generates false alarms in scenarios involving dense but orderly queues, such as those at ticket counters or security checkpoints. This occurs because the model relies on high crowd density as a key anomaly lacks the finer-grained semantic understanding to differentiate between structured (normal) and unstructured (anomalous) crowding patterns.
(2): False Negatives in Subtle or Sparse Events: Conversely, certain rare and subtle anomalies are occasionally missed. These include events like fights within a small group in a sparse zone or an individual’s abrupt change in direction. Such events have a less pronounced effect on global crowd motion patterns and can be overshadowed by dominant, normal activities, making them challenging to detect.

4.8. Interpretability and Memory Visualization

One of the main motivations behind introducing the hippocampal-inspired Episodic Memory Unit (EMU) was to improve the interpretability of crowd anomaly detection. Because each memory slot stores a compact representation of a past episode, together with its spatial context and temporal index, the attention weights

{α_{i}}

used during retrieval can be directly inspected to understand which past events informed the current decision.

For qualitatively selected test cases, we visualize (i) the attention distribution over memory slots, and (ii) the corresponding spatial zones associated with the top-ranked episodes. In typical crowding anomalies, such as sudden density surges near exits, HiMeLSTM assigns high attention to past episodes where local density gradually increased in the same or neighboring zones. This behavior indicates that the model is not reacting to an isolated rather comparing the current pattern to a sequence of contextually similar episodes stored in memory.

In scenarios that lead to false positives—e.g., dense but orderly queues at ticket counters—the attention distribution tends to be more diffuse and often focuses on episodes that share high density but differ in motion regularity. This suggests that the current implementation still overemphasizes local density when distinguishing between structured and unstructured crowding. Conversely, false negatives, such as subtle conflicts within a small group in a sparse region, often exhibit weak or inconsistent attention to truly anomalous episodes, reflecting the fact that their impact on global motion patterns is limited.

Overall, these visualizations demonstrate that the EMU provides a meaningful and human-interpretable mechanism for tracing decisions back to specific past episodes and zones. This property is particularly valuable in safety-critical settings, where operators require not only accurate alarms but also transparent explanations of why a given event has been classified as anomalous.

5. Conclusions

This study introduces HiMeLSTM, a novel and lightweight neural architecture designed for robust and efficient crowd anomaly detection. The model addresses a critical gap in real-world surveillance by combining the temporal modeling strengths of Long Short-Term Memory (LSTM) networks with a computationally inspired episodic memory mechanism, mirroring the contextual recall functions of the hippocampus. This unique integration allows the model to not only process sequential data but also to retain and retrieve relevant long-term contextual information, enabling the detection of subtle and temporally dispersed anomalous behaviors that conventional models often miss.

A key design philosophy behind HiMeLSTM is its operational practicality. The framework relies solely on standard visual data from surveillance feeds, augmented by a straightforward manually defined zone mapping system. This approach deliberately eliminates the dependency on complex, expensive, and often privacy-invasive infrastructure such as distributed sensor networks or detailed digital twins. By doing so, HiMeLSTM significantly lowers the barrier to deployment, offering a solution that is both easier to implement and capable of delivering real-time performance on resource-constrained hardware.

The core innovation, the Episodic Memory Unit (EMU), acts as a dynamic repository of past events. It enhances the model’s contextual awareness by allowing it to cross-reference current crowd dynamics with stored patterns. This is crucial for distinguishing between benign crowd formations (like orderly queues) and genuine threats, thereby reducing false alarms triggered by “structured crowding.”

Comprehensive experimental results on benchmark datasets demonstrate that HiMeLSTM consistently outperforms a range of existing models, including Vanilla LSTM, ConvLSTM, and Transformers, across key metrics such as accuracy, Anomaly Detection Rate (ADR), and F1-score. The model achieves superior anomaly sensitivity without incurring the prohibitive computational overhead of larger architectures, striking an optimal balance between performance and efficiency.

Looking forward, this research opens several promising avenues. Future work will focus on the practical deployment and optimization of HiMeLSTM on edge computing devices, further minimizing latency and power consumption for analysis. Furthermore, we plan to explore multimodal extensions of the architecture, integrating complementary data streams such as audio analysis for detecting auditory anomalies (e.g., screams, crashes) to create a more holistic and robust understanding of complex crowd scenarios, ultimately advancing the frontier of intelligent automated surveillance systems.

Author Contributions

Conceptualization, M.A. and D.-S.K.; methodology, M.A. and H.-Y.L.; software, M.A.; validation, M.A., H.-Y.L. and D.-S.K.; formal analysis, M.A.; investigation, M.A.; resources, H.-Y.L. and D.-S.K.; data curation, M.A.; writing—original draft preparation, M.A.; writing—review and editing, H.-Y.L. and D.-S.K.; visualization, M.A.; supervision, D.-S.K.; project administration, D.-S.K.; funding acquisition, D.-S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea Government (MSIT) (No. RS-2023-00247045).

Data Availability Statement

The benchmark datasets analyzed in this study are publicly available from their original providers as cited in the references. The code implementation and trained model weights for HiMeLSTM are available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to thank the members of Dong-A University and Xi’an Technological University for their helpful discussions on crowd anomaly detection and experimental design.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

ADR	Anomaly Detection Rate
AUC	Area Under the Curve
CNN	Convolutional Neural Network
ConvLSTM	Convolutional Long Short-Term Memory
CCTV	Closed-Circuit Television
EMU	Episodic Memory Unit
FLOPs	Floating-Point Operations
HiMeLSTM	Hippocampal-inspired Memory-enhanced Long Short-Term Memory
LSTM	Long Short-Term Memory
MANN	Memory-Augmented Neural Network
NTM	Neural Turing Machine
PR	Precision-Recall
RNN	Recurrent Neural Network
ROC	Receiver Operating Characteristic
ViT	Vision Transformer

References

Helbing, D.; Mukerji, P. Crowd disasters as systemic failures: Analysis of the Love Parade disaster. EPJ Data Sci. 2012, 1, 7. [Google Scholar] [CrossRef]
Dardour, A.; Haji, E.E.; Begdouri, M.A. Real-Time Video Anomaly Detection in Smart Cities: A Systematic Review of Methods, Challenges and Developments (2018–2024). In Energy-Efficient Algorithms and Systems in Computing: Optimizing Performance and Sustainability Through Advanced Computational Methods; Springer: Cham, Switzerland, 2025; pp. 75–91. [Google Scholar]
Lo, S.S.-H. Policing Crises in Mainland China: The Shenzhen Landslide, Tianjin Explosion and Shanghai Stampede. In The Politics of Policing in Greater China; Palgrave Macmillan US: New York, NY, USA, 2016; pp. 223–242. [Google Scholar]
Balsari, S.; Greenough, P.G.; Kazi, D.; Heerboth, A.; Dwivedi, S.; Leaning, J. Public health aspects of the world’s largest mass gathering: The 2013 Kumbh Mela in Allahabad, India. J. Public Health Policy 2016, 37, 411–427. [Google Scholar] [CrossRef] [PubMed]
Ahmed, Q.A.; Arabi, Y.M.; Memish, Z.A. Health risks at the Hajj. Lancet 2006, 367, 1008–1015. [Google Scholar] [CrossRef] [PubMed]
Yang, D.; Huang, S.; Xu, Z.; Li, Z.; Wang, S.; Li, M.; Wang, Y.; Liu, Y.; Yang, K.; Chen, Z.; et al. Aide: A vision-driven multi-view, multi-modal, multi-tasking dataset for assistive driving perception. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 20459–20470. [Google Scholar]
Graves, A.; Wayne, G.; Danihelka, I. Neural Turing Machines. arXiv 2014, arXiv:1410.5401. [Google Scholar] [CrossRef]
Shi, X.; Chen, Z.; Wang, H.; Yeung, D.-Y.; Wong, W.-K.; Woo, W.-C. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In Proceedings of the 29th International Conference on Neural Information Processing Systems (NeurIPS 2015), Montreal, QC, Canada, 7–12 December 2015; pp. 802–810. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Annual Conference on Neural Information Processing Systems 2017 (NeurIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Grossberg, S. Recurrent neural networks. Scholarpedia 2013, 8, 1888. [Google Scholar] [CrossRef]
Schmidt, R.M. Recurrent neural networks (rnns): A gentle introduction and overview. arXiv 2019, arXiv:1912.05911. [Google Scholar] [CrossRef]
Malhotra, P.; Ramakrishnan, A.; Anand, G.; Vig, L.; Agarwal, P.; Shroff, G. LSTM-based Encoder-Decoder for Multi-sensor Anomaly Detection. arXiv 2016, arXiv:1607.00148. [Google Scholar]
Liu, W.; Luo, W.; Lian, D.; Gao, S. Future Frame Prediction for Anomaly Detection—A New Baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6536–6545. [Google Scholar]
Duong, H.-T.; Le, V.-T.; Hoang, V.T. Deep Learning-Based Anomaly Detection in Video Surveillance: A Survey. Sensors 2023, 23, 5024. [Google Scholar] [CrossRef] [PubMed]
Ahmad, R.; Yang, B.; Ettlin, G.; Berger, A.; Rodríguez-Bocca, P. A machine-learning based ConvLSTM architecture for NDVI forecasting. Int. Trans. Oper. Res. 2023, 30, 2025–2048. [Google Scholar] [CrossRef]
Lee, S.; Kim, D.; Sohn, K.A. Lightweight Spatial-Temporal Transformer for Real-Time Video Anomaly Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–3 October 2023; pp. 12345–12355. [Google Scholar]
Karuvally, A.; Sejnowski, T.; Siegelmann, H.T. General sequential episodic memory model. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 15900–15910. [Google Scholar]
Liu, Y.; Zhang, F.; Yang, S.; Cao, J. Self-attention mechanism for dynamic multi-step ROP prediction under continuous learning structure. Geoenergy Sci. Eng. 2023, 229, 212083. [Google Scholar] [CrossRef]
Collier, M.; Beel, J. Implementing neural turing machines. In Artificial Neural Networks and Machine Learning—ICANN 2018; Springer International Publishing: Cham, Switzerland, 2018; pp. 94–104. [Google Scholar]
Rolls, E.T. A theory of hippocampal function in memory. Hippocampus 1996, 6, 601–620. [Google Scholar] [CrossRef]
Opitz, B. Memory function and the hippocampus. Front. Neurol Neurosci. 2014, 34, 51–59. [Google Scholar] [PubMed]
Criss, A.H.; Howard, M.W. Models of episodic memory. In The Oxford Handbook of Computational and Mathematical Psychology; Oxford University Press: Oxford, UK, 2015; Volume 399, pp. 165–183. [Google Scholar]
Shankar, S.; Pan, Y.; Jiang, H.; Liu, Z.; Darbandi, M.R.; Lorenzo, A.; Chen, J.; Hasan, M.M.; Zidan, A.H.; Gelman, E.; et al. Bridging Brains and Machines: A Unified Frontier in Neuroscience, Artificial Intelligence, and Neuromorphic Systems. arXiv 2025, arXiv:2507.10722. [Google Scholar] [CrossRef]
Xue, K.; Shi, C.; Wang, C. RA-ConvLSTM: Recurrent-architecture attentional ConvLSTM networks for prediction of global total electron content. Space Weather 2025, 23, e2024SW004173. [Google Scholar] [CrossRef]
Niazi, S.G.; Huang, T.; Zhou, H.; Bai, S.; Huang, H.Z. Multi-scale time series analysis using TT-ConvLSTM technique for bearing remaining useful life prediction. Mech. Syst. Signal Process. 2024, 206, 110888. [Google Scholar] [CrossRef]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in Vision: A Survey. ACM Comput. Surv. 2022, 54, 1–41. [Google Scholar] [CrossRef]
Huo, Y.; Gang, S.; Guan, C. FCIHMRT: Feature Cross-Layer Interaction Hybrid Method Based on Res2Net and Transformer for Remote Sensing Scene Classification. Electronics 2023, 12, 4362. [Google Scholar] [CrossRef]
Thisanke, H.; Deshan, C.; Chamith, K.; Seneviratne, S.; Vidanaarachchi, R.; Herath, D. Semantic segmentation using vision transformers: A survey. Eng. Appl. Artif. Intell. 2023, 126, 106669. [Google Scholar] [CrossRef]
Dong, G.; Zhao, C.; Pan, X.; Basu, A. Learning Temporal Distribution and Spatial Correlation Toward Universal Moving Object Segmentation. IEEE Trans. Image Process. 2024, 33, 2447–2461. [Google Scholar] [CrossRef] [PubMed]
Ranjan, A.; Jain, S.; Stevens, J.R.; Das, D.; Kaul, B.; Raghunathan, A. X-mann: A crossbar based architecture for memory augmented neural networks. In Proceedings of the 56th Annual Design Automation Conference, Las Vegas, NV, USA, 2–6 June 2019; pp. 1–6. [Google Scholar]
Fleury, M.; Buck, S.; Binding, L.P.; Caciagli, L.; Vos, S.B.; Winston, G.P.; Thompson, P.J.; Koepp, M.J.; Duncan, J.S.; Sidhu, M.K. Episodic memory network connectivity in temporal lobe epilepsy. Epilepsia 2022, 63, 2597–2622. [Google Scholar] [CrossRef] [PubMed]
Qian, Y.; Ye, S.; Wang, C.; Cai, X.; Qian, J.; Wu, J. UCF-Crime-DVS: A Novel Event-Based Dataset for Video Anomaly Detection with Spiking Neural Networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; pp. 6577–6585. [Google Scholar]
Dai, F.; Liu, H.; Ma, Y.; Zhang, X.; Zhao, Q. Dense scale network for crowd counting. In Proceedings of the 2021 International Conference on Multimedia Retrieval, Taipei, Taiwan, 21–24 August 2021; pp. 64–72. [Google Scholar]
Han, X.; Zhang, Z.; Wu, Y.; Zhang, X.; Wu, Z. Event traffic forecasting with sparse multimodal data. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 8855–8864. [Google Scholar]

Figure 1. Overall HiMeLSTM pipeline for crowd anomaly detection. The input frames

I_{t} = R^{H \times W \times 3}

are encoded by a CNN into feature maps

f_{t} \in R^{C \times H^{'} \times W^{'}}

, which are then mapped to zone-level descriptors

z_{t} \in R^{K \times D}

. The HiMeLSTM backbone (LSTM + EMU) produces hidden states

h_{t} \in R^{D_{h}}

and contextual recalls

\hat{h_{t}} \in R^{D_{h}}

, which are combined to produce an anomaly score

s_{t}

. The EMU stores episodic memories

{\{m_{i} \in R^{D_{h}}\}}_{i = 1}^{M}

, enabling both long-term context modeling and interpretability.

Figure 1. Overall HiMeLSTM pipeline for crowd anomaly detection. The input frames

I_{t} = R^{H \times W \times 3}

are encoded by a CNN into feature maps

f_{t} \in R^{C \times H^{'} \times W^{'}}

, which are then mapped to zone-level descriptors

z_{t} \in R^{K \times D}

. The HiMeLSTM backbone (LSTM + EMU) produces hidden states

h_{t} \in R^{D_{h}}

and contextual recalls

\hat{h_{t}} \in R^{D_{h}}

, which are combined to produce an anomaly score

s_{t}

. The EMU stores episodic memories

{\{m_{i} \in R^{D_{h}}\}}_{i = 1}^{M}

, enabling both long-term context modeling and interpretability.

Figure 2. Representative normal and anomalous scenes from the UCF-Crime [32] and ShanghaiTech Campus [33] datasets. For each dataset, normal traffic or pedestrian flows (left panels) are contrasted with different types of anomalous events (right panels), including road accidents at urban intersections, an attempted robbery near a crosswalk, and fight/chase behaviors in a dense crowd. These examples illustrate the visual diversity and complexity of crowd anomalies modeled by the proposed HiMeLSTM framework.

Figure 3. Example scenes from the CrowdSurge-1K [34] dataset. The left panel shows a normal crowd flow in a large transit terminal with moderate pedestrian density, whereas the right panel depicts an anomalous scenario with a sudden surge in crowd density near the ticket gates, illustrating a high-risk pre-crush condition.

Figure 4. ROC and Precision-Recall curves comparing HiMeLSTM with baseline models. ROC curves (top) and PR curves (bottom) for HiMeLSTM and the baseline models, including Vanilla LSTM, ConvLSTM, and a lightweight Transformer. In both subfigures, each colored curve corresponds to a different model, as indicated in the legend, enabling a direct visual comparison of detection performance. The ROC curves illustrate the trade-off between true positive rate and false positive rate, while the PR curves highlight the balance between precision and recall, which is particularly important for imbalanced anomaly detection scenarios.

Figure 5. Ablation study results on EMU presence, zone mapping, and memory slot size.

Figure 6. Error Case Analysis of HiMeLSTM: False Positive and False Negative Scenarios.

Table 1. Comparative summary of representative crowd anomaly detection models, highlighting their core characteristics, strengths, and inherent limitations from a methodological perspective.

Model Type	Key Characteristics	Strengths	Limitations
LSTM	Sequential modeling of temporal data	Effective for short-term dependencies	Weak retention of long-term dependencies, limited spatial awareness
ConvLSTM	Convolution + LSTM for spatiotemporal learning	Captures spatial correlations in videos	Computationally expensive, struggles with global context representation
Transformer	Self-attention for global sequence modeling	High accuracy, strong long-range modeling	Requires large-scale datasets, high computational cost, limited interpretability
NTM/MANN/ Episodic Memory	Neuroscience-inspired architectures with external memory	Retain contextual memory, biologically plausible	Complex training, limited real-world surveillance applications
HiMeLSTM (Proposed)	LSTM + Hippocampal-inspired Episodic Memory Unit	Strong long-term memory retention, interpretable, real-time capable	Requires careful zone definition, needs validation across diverse datasets

Table 2. Performance Comparison of Crowd Anomaly Detection Models.

Model	Accuracy (%)	ADR (%)	F1-Score
LSTM	89.2	82.4	0.84
ConvLSTM	90.6	84.7	0.85
Transformer	91.1	85.3	0.85
MemAE	91.8	86.1	0.86
ST-AE	92.0	86.5	0.87
Our Proposed	93.5	89.6	0.89

Table 3. Model Efficiency Comparison.

Model	Training Cost (GPU-h)	Model Size (MB)	FLOPs (×10⁹)	FPS (1080p)
LSTM	4.1	35	12.6	45.2
ConvLSTM	6.3	48	19.2	32.1
Transformer	8.7	112	42.5	18.5
Our Proposed	5.0	42	14.8	38.4

Table 4. Ablation study of the main architectural components of HiMeLSTM.

Config	EMU	Zone Mapping	Memory Slots	Accuracy	ADR	F1-Score
Full model (HiMeLSTM + zones)	Yes	Yes	256	93.5	89.6	0.89
w/o EMU	No	Yes	-	-	-	-
w/o zone mapping	Yes	No	256	-	-	-

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

An, M.; Lim, H.-Y.; Kang, D.-S. An Enhanced LSTM with Hippocampal-Inspired Episodic Memory for Urban Crowd Behavior Analysis. Electronics 2026, 15, 101. https://doi.org/10.3390/electronics15010101

AMA Style

An M, Lim H-Y, Kang D-S. An Enhanced LSTM with Hippocampal-Inspired Episodic Memory for Urban Crowd Behavior Analysis. Electronics. 2026; 15(1):101. https://doi.org/10.3390/electronics15010101

Chicago/Turabian Style

An, Mingshou, Hye-Youn Lim, and Dae-Seong Kang. 2026. "An Enhanced LSTM with Hippocampal-Inspired Episodic Memory for Urban Crowd Behavior Analysis" Electronics 15, no. 1: 101. https://doi.org/10.3390/electronics15010101

APA Style

An, M., Lim, H.-Y., & Kang, D.-S. (2026). An Enhanced LSTM with Hippocampal-Inspired Episodic Memory for Urban Crowd Behavior Analysis. Electronics, 15(1), 101. https://doi.org/10.3390/electronics15010101

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

An Enhanced LSTM with Hippocampal-Inspired Episodic Memory for Urban Crowd Behavior Analysis

Abstract

1. Introduction

2. Related Work

2.1. LSTM-Based Anomaly Detection

2.2. ConvLSTM Applications

2.3. Transformer-Based Crowd Detection

2.4. Neuroscience-Inspired Architectures

2.5. Comparative Summary

3. Proposed Method

3.1. Input Data Modalities

3.2. CNN Encoder and Zone Mapper

3.3. LSTM Temporal Encoder

3.4. Episodic Memory Unit (EMU)

3.5. Anomaly Detection Module

3.6. Architectural Details and Output Dimensions

4. Results and Discussion

4.1. Datasets

4.2. Implementation and Evaluation Metrics

4.3. Detection Performance

4.4. Computational Efficiency

4.5. ROC and Precision-Recall Analysis

4.6. Ablation Study

4.7. Error Case Analysis

4.8. Interpretability and Memory Visualization

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI