A Model for Detecting Abnormal Elevator Passenger Behavior Based on Video Classification

: In the task of human behavior detection, video classification based on deep learning has become a prevalent technique. The existing models are limited due to an inadequate understanding of behavior characteristics, which restricts their ability to achieve more accurate recognition results. To address this issue, this paper proposes a new model, which is an improvement upon the existing PPTSM model. Specifically, our model employs a multi-scale dilated attention mechanism, which enables the model to integrate multi-scale semantic information and capture characteristic information of abnormal human behavior more effectively. Additionally, to enhance the characteristic information of human behavior, we propose a gradient flow feature information fusion module that integrates high-level semantic features with low-level detail features, enabling the network to extract more comprehensive features. Experiments conducted on an elevator passenger dataset containing four abnormal behaviors (door picking, jumping, kicking, and door blocking) show that the top-1 Acc of our model is improved by 10% compared to the PPTSM model, reaching 95%. Moreover, experiments with four publicly available datasets(UCF24, UCF101, HMDB51, and the Something-Something-v1 dataset) demonstrate that our method achieves results superior to PPTSM by 6.8%, 6.1%, 21.2%, and 3.96%, respectively.


Introduction
Elevators serve as essential vertical conveyance mechanisms within modern urban infrastructure, providing convenience for people in daily work and life, with safety as a top priority [1,2].Incidences of irregular conduct within these confined spaces-ranging from door picking, jumping, kicking, door blocking, inadvertent falls, or misplaced items to deliberate acts of vandalism-can precipitate safety risks.Such behaviors not only endanger the well-being of passengers but also compromise the integrity of the elevator system [3][4][5].
In recent years, machine learning methodologies have been employed for the detection of aberrant behaviors in elevators.Zhu et al. [6] extracted people and objects from surveillance videos through background subtraction, counted the number of people in a picture and combined the image entropy of motion history images (MHIs) to determine whether passengers fell or acted violently.Sun et al. [7] introduced a detection approach based on the kinetic energy of corners to identify instances of aggression among elevator occupants.In a similar vein, Liu et al. [8] harnessed multi-feature fusion, combined with machine vision, to detect falls of passengers within the elevator confines.While these strategies have shown a capability to identify unusual human behaviors, they are beset with challenges such as inadequate feature extraction capabilities from images and videos and the failure to effectively model temporal information.
The advent of deep learning has heralded remarkable advancements in the realms of video comprehension and human activity recognition [9].Lan et al. [10] put forth a dual-stream neural network algorithm for the detection of anomalous behaviors, such as falls, physical altercations, and tampering with doors within elevator environments.Chen, Y., et al. [11] developed an enhanced two-stream neural network architecture, leveraging a 3D ResNet framework, specifically tailored for the detection of falls among elevator passengers through the application of edge computing technology.Shi, Y., et al. [12] used the key point detection algorithm Openpose to detect human skeleton points and detect abnormal door blocking and door-picking behaviors among passengers.
While the previously mentioned methods are capable of identifying abnormal behaviors of passengers in elevator cabins, they encounter challenges such as substantial computational demand, sophisticated detection algorithms, and stringent hardware requirements.
This paper utilizes the PPTSM network as the baseline model for implementing the detection of elevator passengers' abnormal behavior and proposes improvements to the PPTSM network.To detect the abnormal behavior of passengers in elevators, this paper proposes a video classification model named Temporal Shift Module Network with Col-Depth-Point Convolution and Multi-Scale Dilated Attention(TSM-CDPMSDANet). This model leverages the capabilities of the ResNet50 backbone network, combined with the Temporal Shift Module (TSM) [13].
Initially, we embarked on gathering a diverse array of passenger behavior videos within elevator cabins, focusing on four primary behaviors: door picking, jumping, kicking, and door blocking.This effort led to the creation of a comprehensive dataset dedicated to passenger abnormal behavior.Subsequently, we enhanced the ResNet50 architecture by integrating it with the TSM, which was then incorporated into the forefront of a residual block in the network.Notably, the inclusion of a Multi-Scale Dilated Attention (MSDA) [14] module enables our network to concentrate on nuanced features indicative of abnormal behaviors, significantly boosting its capability to detect such activities with heightened accuracy.Further advancements were made through the development of a gradient flow feature fusion module.This innovative component merges high-level semantic features with intricate low-level details, effectively minimizing the network's parameter count while simultaneously augmenting its recognition capabilities.In the final stage, we implemented a data augmentation strategy throughout the training process to enrich dataset diversity and bolster the model's robustness.This was harmoniously combined with the utilization of the cross-entropy loss function to effectively address and mitigate issues arising from imbalances in data categories.The empirical evidence from our experiments convincingly demonstrates that our model substantially elevates the detection accuracy of the four highlighted abnormal behaviors, marking a significant improvement in its diagnostic prowess.

Human Behavior Recognition
Traditional human detection techniques rely on manually created features and template matching methods.Gal et al. [15] proposed a human detection method that combines random forest and Hough transform.Although this method shows good results under specific conditions, this method is limited due to feature selection and extraction, and it cannot capture the complex context and diversity features required for human detection.The development of deep learning has brought new possibilities to human body detection.Donahue et al. [16] used a single-stream method to successfully integrate the two deep learning models of a convolutional neural network and a long short-term memory network, making full use of CNN's efficient ability for image feature extraction and LSTM's ability to process the powerful performance in time series data, enabling a more accurate recognition of human movements.Simonyan [17] and Feichtenhofer, C., et al. [18] adopted the dual-flow method to achieve the effective recognition of behaviors by combining prior information (such as optical flow information) with image features.TS-LSTM and temporal inception [19] are based on LSTM, and they fuse high-level spatial and temporal features to learn hidden features over time.By properly leveraging temporal information at multiple scales, better performance can be achieved even when feature vectors are used as inputs (i.e., feature maps are not used).However, these methods are based on using only 2D convolution and 2D pooling operations.Using only a 2D CNN network to extract features will have the disadvantage of insufficient extraction of global features via the network.
Three-dimensional convolutional neural networks (3D ConvNets) can comprehensively consider temporal and spatial information, so they show great potential in processing tasks such as video sequences, action behaviors, and motion trajectories, and they are, thus, applied in the field of human action recognition [20][21][22][23].SlowFast [24] representatives believe that the process of human behavior is very fast.Therefore, they conduct research on the special dimension of time and design a specific branch to fuse information with different time resolutions to achieve the effect of capturing high-frame-rate action information.These methods reduce the number of parameters, thereby reducing the hardware resources required to run the model.However, their recognition accuracy of human behavior is not high.
Therefore, the method of using skeletal key points for behavior recognition significantly improves the accuracy of human behavior recognition.The ST-GCN [25] network extracts skeleton sequence features through spatio-temporal graph convolution.Skeletonbased information can capture motion information very well, which greatly improves the network's recognition of human behavior.In view of the unreasonable adjacency matrix strategy used in this network, the two-stream adaptive graph convolution network (2s-AGCN) [26] proposed an improved adjacency matrix strategy.In this network, the topology of the graph is determined via BP.The algorithm is learned end-to-end uniformly or individually to increase the flexibility of the model construction graph and achieve greater versatility in adapting to various data samples, making it more suitable for action recognition tasks.LDT-NET [27] proposes a lightweight behavior recognition model, which is based on skeleton points and designed using a multi-stream neural network architecture.This model uses a multi-stream, deep, separable convolutional neural network to successfully achieve high-efficiency action recognition by integrating feature information at three different scales.Although these methods can improve the accuracy of behavior recognition, they face problems such as high data quality requirements and complex calculations.

Detection of Abnormal Human Behavior in Elevators
Feng et al. [28] designed an elevator monitoring system by combining infrared sensors and smoke alarms to effectively monitor the abnormal behavior of passengers in an elevator cabin.Zhao et al. [29] detected elevator doors through background subtraction and used the YOLOv3 algorithm to count elevator passengers.However, they did not consider the interference of moving objects during the elevator door detection process and the problem of double counting of persons.Wu et al. [30] implemented the functions of judging the status of elevator doors, counting the non-duplicate flow of elevator passengers, and identifying passenger attributes.Yan Qi et al. [31] designed and developed a video monitoring system for elevator abnormal behavior using the edge computing paradigm.The identification of abnormal image sequences and the assessment of abnormal passenger behavior were realized.Shu et al. [32] used the Lucas-Kanade optical flow method to extract motion speed data and constructed a comprehensive feature vector based on this.This feature vector not only fuses angular momentum information but also summarizes key features of the target, which makes it effective in detecting fights and violent behaviors in elevators.GraftNet [33] launched an advanced solution for fine-grained multi-label recognition tasks, specifically used to accurately identify human body features and perform the anomaly monitoring of human flow data through unsupervised learning, effectively capturing abnormal behaviors that may lead to security risks or illegal activities.TSTR-ResNet [34] introduces a new abnormal elevator passenger behavior recognition technology (TSTR module) based on time offset and reinforcement.This count effectively captures spatiotemporal information through the network module, and it uses wavelet decomposition to extract low-frequency subbands of the image to reduce high-frequency noise interference, thereby improving the accuracy of abnormal behavior detection.The above methods face the problem of low accuracy in identifying the abnormal behavior of passengers in an elevator.

Overview
The architecture of our model is illustrated in Figure 1, and it is named the Temporal Shift Module Network with Col-Depth-Point Convolution and Multi-Scale Dilated Attention(TSM-CDPMSDANet), featuring a dual-component design: a backbone and a head.The PPTSM model serves as our baseline implemented via the PaddlePaddle framework, incorporating ResNet50_vd [35] as its backbone network, and it is enriched with the temporal shift module from the TSM model.As depicted in Figure 2, this temporal shift module is seamlessly integrated into the two-dimensional convolutional neural network, thus elevating the network's ability to model temporal information without the necessity of extra computational resources.From Figure 1, it can be seen that the input image extracts deep-level features through the backbone and then performs classification tasks through the head.The MSDA module we proposed is employed before the head because the image can obtain a deep-level feature representation after being processed via the backbone.The CDPWC module we proposed is distributed in each feature extraction stage of the backbone, which can effectively improve model performance.The TSM-CDPMSDANet model initiates its process with the input image undergoing an initial processing block designed to distill preliminary shallow image features.These features are then propelled through four progressive feature extraction stages to unearth more profound characteristics.At the head of the model, we integrate an MSDA mechanism that orchestrates the interplay between local and sparsely distributed image patches.This facilitates the capture of multi-scale semantic information, enhancing the model's precision in differentiating nuanced disparities across various abnormal behaviors.Concurrently, this enhancement bolsters the model's prowess in handling complex data and elevates its classification accuracy for diverse abnormal activities.
To augment the network's capacity for capturing a richer array of behavioral feature information, our backbone architecture incorporates the sophisticated Col-Depth-Point Wise Convolution(CDPWC) module.This module, which leverages depth-separable convolution, is meticulously integrated at the input-processing level, and it extends through each subsequent feature extraction stage.The deployment of this module ensures optimized gradient flow information, thereby enhancing the fidelity of feature representation and enabling a more nuanced understanding of behavioral patterns.This strategic module insertion trims the overall network's parameter count while simultaneously empowering the model to attain heightened behavioral recognition accuracy.The detailed procedures of the TSM-CDPMSDANet training pipeline are summarized in Algorithm 1.

Multi-Scale Dilated Attention Module (MSDA)
The primary challenge in detecting passengers' abnormal behaviors within elevator cabins lies in the complexity of fully capturing the diverse manifestations of such behaviors.To tackle this, we introduced MSDA into the head of our model, enabling it to more effectively concentrate on the key features indicative of various behaviors.As depicted in Figure 3, the MSDA module strategically segments the channels of the input feature map into distinct head subsets by employing a multi-head structure (set to 8 in our study).Each head subset undergoes Sliding Window Dilated Attention (SWDA), which assigns unique dilation coefficients to each head, treats a single, original pixel as a token, and selects 8 neighboring pixels based on the dilation coefficient.In our model, a convolution kernel with an expansion rate of 1 can capture local details, while a convolution kernel with an expansion rate of 2 can capture a wider range of global information.This process involves computing self-attention for the sparsely sampled patches within the designated area.The utilization of distinct dilation coefficients across the heads fosters the amalgamation of feature information across varying scales, enriching the model's interpretative depth.The computation formula for SWDA is as follows.
In the aforementioned formula, the matrices labeled Q, K, and V correspond to the query, key, and value components, respectively.Each row within these matrices is representative of a distinct feature vector, serving as a foundational element in the computation process.SWDA employs the original feature map's coordinates (i, j) as the focal query point, from which it sparsely selects associated keys and values.It then computes self-attention within a defined sliding window, which is centered on the query point and spans a dimension of w × w .To address the edge positions within the feature map during self-attention computations, SWDA adopts a zero-padding technique, ensuring continuity and preventing data loss.Through this selective and strategic sparse sampling of keys and values around the query point, SWDA not only adheres to the requirements of sparsity but also preserves the principle of locality.This approach is pivotal for effectively capturing dependencies within long-distance sequences, thereby enhancing the model's capability to discern and understand the subtleties of complex spatial-temporal patterns.

Gradient Flow Information Aggregation Block Col-Depth-Point Convolution (CDPWC)
To minimize the network's overall parameter count while enhancing the precision of detecting passengers' abnormal behaviors, we developed a Gradient Flow Information Aggregation Block, denoted as CDPWC, which utilizes depthwise separable convolutions.Illustrated in Figure 4, the input feature map initially passes through a 1 × 1 convolutional operation.This key step involves using c_out 1 × 1 convolution kernels to perform convolution operations on each pixel position (i, j) of the input feature map (h × w × c_in), outputting c_out values.We set c_out = 2 × c_in , thus doubling the number of channels in the input feature map to optimize the channel size.Through this approach, the shape of the feature map was reshaped, providing a larger space for feature expression and laying the foundation for more complex processing.Subsequent to this channel expansion, the resulting output is bifurcated through a split operation.One segment undergoes processing via a series of Bottleneck_DPWC modules, which are tasked with extracting more sophisticated feature representations.These refined expressions serve to encapsulate the salient feature information of the input data more effectively.The other segment retained from the split is then recombined with the processed features through a concatenation operation.This fusion not only integrates the high-level semantic features with the granular details of the lower-level features but also allows the model to access a spectrum of feature levels and gradients.Consequently, this rich blend of features significantly contributes to the model's ability to discern and classify abnormal behaviors with greater accuracy.Note that, in Figure 4, h, w, and c_in(c_out) refer to the pixel length, width, and channel dimension of the feature map, respectively; ×0.5(n + 2)c_out means that the number of feature maps in the channel dimension is half the number of output channels c_out(n + 2), where n represents the number of Bottleneck_DPWC blocks.Departing from the conventional C2f framework, the Bottleneck_DPWC module we devised incorporates Depthwise Separable Convolution (DSC) [36] to notably diminish the model's parameter while elevating its predictive accuracy.Illustrated in Figure 5, the DSC process is initiated with the input feature map undergoing a DepthWise Convolution(DWC) operation.This entails executing a convolution operation individually on each channel of the feature map, with the outcomes subsequently merged to form the output feature map.Given that each channel is convolved with a singular convolution kernel, this procedure neither expands nor compresses the channel dimension of the feature maps, nor does it effectively capitalize on the inter-channel correlation of features situated at identical spatial coordinates.To foster interaction among the features across different channels, the resultant feature map from DWC is directed through two PointWise Convolution(PWC) blocks.These blocks employ a 1 × 1 convolution operation to amalgamate the features across channels, thereby augmenting the feature representation's capacity and simultane-ously reducing the model's parameter count.This strategic enhancement of channel-wise feature interaction substantially bolsters the model's expressive power, making it more adept at accurately modeling complex patterns and relations in the data.The calculation process of the CDPWC module is shown in the following formula, where BD represents Bottleneck_DPWC, and + represents concat operation.Assume that the channel of the input feature map is M and the size is D F × D F ; the convolution kernel is D K × D K , and the channel of the output feature map is N, while the size is D F × D F .Then, the parameters of ordinary convolution are as follows: The parameter amount of depthwise separable convolution is as follows: From the discussion above, it becomes apparent that Depthwise Separable Convolution (DSC) significantly outperforms traditional convolution in terms of computational efficiency.

Experiment 4.1. Experimental Environment and Parameter Settings
The experimental setup was conducted on a server boasting the Ubuntu-20.04operating system, equipped with an NVIDIA GeForce RTX 3090 GPU for graphical processing.Model training was executed utilizing the PaddlePaddle deep learning framework.Within the training hyperparameters, Momentum was selected as the optimization algorithm, with a set momentum coefficient of 0.9.The learning rate was scheduled to decrease following a cosine annealing strategy.All models were trained from scratch, without the aid of pre-trained weights, and adhered uniformly to a consistent data augmentation protocol.Additional training parameters were retained at the framework's default settings for consistency across the experiments.Table 1 lists in detail the training parameters adopted in our proposed model for further reference.And our code is available at https://github.com/Bradly-s/TSM-CDPMSDANet(accessed on 17 June 2024).

Elevator Passenger Abnormal Behavior Dataset
For the examination of abnormal passenger behaviors in indoor elevator scenarios, we amassed video footage covering four distinct categories: door picking, jumping, kicking, and door blocking.From each video capturing these abnormal behaviors, segments ranging from 4 to 10 s were extracted.Subsequently, individual frames were isolated from these segments to compile corresponding images.These images were then organized into a dataset formatted for experimental use.The cumulative count of images assembled for the respective categories of abnormal behaviors are as follows: door picking yielded 4334 images, jumping activities contributed 2891 images, instances of kicking resulted in 4775 images, and door blocking accounted for 6311 images.Data examples are shown in Figure 6.

Public Behavior Recognition Dataset
In order to assess the generalization capabilities and robustness of our model, we carried out evaluative tests on four publicly available datasets.The datasets utilized for this purpose are delineated as follows: (1) UCF101 dataset UCF101 [37] is a comprehensive action recognition dataset that encompasses 101 distinct categories of actions, featuring roughly 100 videos per category.It stands out due to its unparalleled diversity in action types, coupled with significant variations in several aspects, including camera movement, the appearance and pose of objects, the scale of objects, viewpoints, the presence of cluttered backgrounds, and lighting conditions.These factors collectively render UCF101 the most challenging dataset in the field to date.
(2) UCF24 dataset UCF10_24 represents a specialized subset of the broader UCF101 dataset, distinguished by its utilization of an alternative labeling scheme.In this subset, each video is characterized by the presence of, at most, a single type of target behavior, and bounding boxes (bboxes) are employed exclusively to annotate individuals engaged in the specified target behavior.
(3) hmdb51 dataset The HMDB-51 [38] dataset comprises 51 distinct categories of behaviors, with each category containing a minimum of 101 video instances, summing up to a total of 6766 short video clips.The behaviors captured in these videos are further classified into five categories: actions involving facial expressions and object interactions, fundamental body movements, actions that involve interacting with objects, and complex bodily actions.In our experimental framework, for each category within the dataset, 70% of the videos were designated for the training set, while the remaining 30% were allocated to the test set.
(4) Something-Something-v1 dataset The Something-Something-v1 [39] video dataset is made available through a substantial TGZ archive, systematically segmented into portions of up to 1 GB, culminating in a total download volume of 25.2 GB.It encompasses a comprehensive collection of 108,499 videos, each comprising JPG images that maintain a consistent height of 100 pixels and a variable width.These JPG images are meticulously extracted from the original footage at a rate of 12 frames per second.The dataset is organized into a structured format, featuring a training set with 86,017 images, a validation set comprised of 11,522 images, and a test set that includes 10,960 images.

Evaluation Indicators
Our video classification model underwent a thorough evaluation process using a dataset specifically composed of four categories of abnormal elevator passenger behaviors, in addition to four publicly available datasets.To determine the model's precision, we employed the classification accuracy metric (Top-1 accuracy, abbreviated as Acc).The formula employed to calculate the accuracy is articulated below:

Comparative Experiment
Within the dataset detailing abnormal passenger behaviors in elevators that we curated, our model underwent comparative analysis against leading-edge models in the field.The outcomes of this comparison are detailed in Table 2.An examination of the table reveals that our model surpasses contemporary advanced models in terms of the Top-1 accuracy (Acc) metric.Specifically, when juxtaposed with the PPTSM model, our model demonstrated a significant enhancement of 10%, achieving a Top-1 accuracy of 95%.Even when compared to the more accurate SlowFast model, our model exhibited an improvement of 3%, underscoring the exemplary performance of our approach.To ascertain the generalizability and robustness of our model, we conducted a series of experiments across four public behavior recognition datasets: UCF24, UCF101, HMDB51, and Something-Something-v1.We evaluated both our enhanced model and the baseline model in these experiments.A comparative analysis, focusing on the Top-1 accuracy (Acc) metric, is systematically presented in Table 3. Table 3 clearly demonstrates the performance enhancements achieved using our improved model over the original model.Specifically, with the UCF24 dataset, there was a significant increase in Top-1 accuracy (Acc) by 6.8%.Our model improved by 3.78%, 4.83%, and 9.74%, respectively, on the three divisions, split1, split2, and split3, of the UCF101 data set.With the HMDB51 dataset, our model showcased a remarkable improvement of 21.24% compared to the original model.Conversely, with the Something-Something-v1 dataset, the enhancement was more modest at 3.959%.The relatively smaller gain with the Something-Something-v1 dataset can be attributed to its focus on the interaction between objects, for which understanding actions often relies heavily on contextual information.Given that such context is somewhat limited when extracted solely from video clips, this could explain the lesser degree of improvement.The results from these experiments across various public datasets affirm the robustness and generalization capability of our model, indicating its strong performance even in scenarios requiring nuanced contextual interpretation.

Ablation Experiment
To assess the effectiveness of our proposed model, we undertook a series of ablation studies with four distinct datasets of abnormal passenger behavior, namely door picking, jumping, kicking, and door blocking.These experiments were designed to evaluate the efficacy of the Multi-Scale Dilated Attention (MSDA) we integrated, as well as the performance of the proposed Gradient Flow Information Aggregation Block (CDPWC).Additionally, these studies aimed to confirm whether the Gradient Flow Information Aggregation Block could indeed effectively minimize the model's parameter count.Employing the video classification model PPTSM as our baseline, we incrementally introduced our modules to conduct these ablation studies.The outcomes of these experiments are meticulously detailed in Table 4.
The data presented in the table clearly indicate that, upon integrating the C2f module, there was a notable 5% increase in accuracy, accompanied by a reduction in the model's parameters by 2.515 M and a decrease in model size by 9.61 MB.However, this was offset by a rise in computational demand, evidenced by an increased need for 3.006 GFLOPs.When the proposed CDPWC module was added to the baseline independently, it also elevated model accuracy by 5%.More impressively, it resulted in a substantial decrease in model parameters of 7.507 M, a significant reduction in computational load of 7.533 GFLOPs, and a notable decline in model size of 28.63 MB.The reason for the decrease in the parameter count is that, in the CDPWC module, we replaced the original two ordinary 3 × 3 convolutions with depthwise separable convolutions in Bottleneck_DPWC.According to Equation ( 5), the computational complexity of the model significantly decreased.The standalone inclusion of the MSDA module led to a significant boost in accuracy of 10%, but this came at the cost of an increase in model parameters of 16.785 M, a surge in computational requirements of 6.576 GFLOPs, and an enlargement of the model size from 90.02 MB to 154.03 MB.From the comparisons drawn in Table 4, it can be inferred that, while the C2f module enhances model accuracy and reduces both the number of parameters and the model size, it does so at the expense of increased computational complexity.On the other hand, adding the CDPWC module not only effectively improves the model's accuracy but also markedly diminishes both the parameter count and the model size, thereby reducing the overall model complexity.The MSDA leverages varying dilation rates to expand the receptive field of the convolutional operations, affording it the capacity to discern intricate feature details associated with different behaviors.Consequently, the inclusion of MSDA enhances model accuracy and, due to its transformer-like structure, amplifies model complexity.In this study, we synergized both the CDPWC and MSDA modules.With this combination, at an accuracy peak of 95%, we were able to effectively capture subtle characteristics of distinct behaviors while simultaneously bolstering the model's ability to identify a variety of abnormal behaviors.

Visualization
To vividly illustrate the comparative performance between our model and the PPTSM model, we plotted the prediction accuracies of both models across different categories within the UCF24 and HMDB51 datasets.These comparisons are visually represented in Figures 7 and 8. From Figure 7, it is evident that, within the UCF24 dataset, our model demonstrates a marginal decrease in accuracy for the categories of "GolfSwing," "Skiing," and "SoccerJuggling."Despite these specific instances, our model outperforms the PPTSM model in the majority of categories, showcasing superior overall results.Figure 8 further supports our model's enhanced performance across the board within the HMDB51 dataset.Despite minor declines in the "smile" and "walk" categories, these do not detract from the overall superiority of our improved model.This evidence underscores the enhanced efficacy and applicability of our model across a diverse range of categories.

Conclusions
In this paper, we have embarked on an in-depth exploration of the intricate domain of the detection of passengers' abnormal behavior in elevators.Therefore, we collected data on four abnormal behaviors of passengers in elevator cabins (door picking, jumping, kicking, and door blocking) from real-life scenarios and established a proprietary dataset as the data support for this study.Our goal was to provide a robust solution that aptly addresses the limitations imposed via model accuracy inherent to the scenarios of human behavior detection in elevators.Accordingly, we have proposed the TSM-CDPMSDANet model.In our proposed model, the multi-scale dilated attention (MSDA) mechanism is introduced to capture the subtle properties of various abnormal behaviors and enable the model to effectively identify passengers' behaviors at different scales.Another important module is the gradient flow information aggregation block Col-Depth-Point Convolution(CDPWC), which is designed to promote the fusion of complex, low-level detail features and high-level semantic features and ultimately achieve a synergistic effect of feature complementarity.
The experimental results demonstrate that our model achieved 95% accuracy in identifying abnormal behavior with our established dataset.Additionally, when compared to various advanced models and evaluated across different public datasets, our model consistently outperforms, demonstrating robust generalization capabilities.Looking to the future, our work lays the groundwork for practical applications in human behavior detection, offering enhanced accuracy.However, due to practical constraints (the fixed-angle camera systems in the elevators around us and the insufficient number of passengers to create occlusive conditions), we cannot capture images of passengers' occlusions or multi-angle views inside the elevator.Consequently, we aim to address this issue in the future.

Figure 1 .
Figure 1.Architecture diagram of Temporal Shift Module Network with Col-Depth-Point Convolution and Multi-Scale Dilated Attention(TSM-CDPMSDANet).

Algorithm 1
TSM-CDPMSDANet Training Pipeline.Input: D(x, y): the labeled data x means frames of videos; y means the label.Output: Trained TSM-CDPMSDANet model F(•).

Figure 6 .
Figure 6.Examples of four types of abnormal behavior.

Table 2 .
Comparative experiments on abnormal behavior dataset of passengers in elevators.

Table 3 .
Comparative experiments on public behavior recognition dataset.

Table 4 .
Comparison of evaluation of each module in the ablation experiment.