Cross-Modal Behavioral Intelligence in Regard to a Ship Bridge: A Rough Set-Driven Framework with Enhanced Spatiotemporal Perception and Object Semantics

Chen, Chen; Wei, Yuenan; Ma, Feng; Shu, Zhongcheng

doi:10.3390/app15137220

Open AccessArticle

Cross-Modal Behavioral Intelligence in Regard to a Ship Bridge: A Rough Set-Driven Framework with Enhanced Spatiotemporal Perception and Object Semantics

¹

School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan 430205, China

²

State Key Laboratory of Maritime Technology and Safety, Wuhan University of Technology, Wuhan 430079, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(13), 7220; https://doi.org/10.3390/app15137220

Submission received: 25 May 2025 / Revised: 23 June 2025 / Accepted: 24 June 2025 / Published: 26 June 2025

Download

Browse Figures

Versions Notes

Abstract

Aberrant or non-standard operations by ship drivers are a leading cause of water traffic accidents, making the development of real-time and reliable behavior detection systems critically important. However, the environment within a ship’s bridge is significantly more complex than typical scenarios, such as vehicle driving or general security monitoring, which results in poor performance when applying generic algorithms. In such settings, both the accuracy and efficiency of existing methods are notably limited. To address these challenges, this paper proposes a cross-modal behavioral intelligence framework designed specifically for a ship’s bridge, integrating multi-target tracking, behavior recognition, and feature object association. The framework employs ByteTrack, a high-performance multi-object tracker that maintains stable tracking even when subject to occlusions or motion blur through its novel association mechanism, using both high and low confidence detection boxes, for multi-driver tracking. Combined with an improved Temporal Shift Module (TSM) algorithm for behavior recognition, which effectively resolves issues concerning target association and action ambiguity in complex environments, the proposed framework achieves a Top-1 accuracy of 82.1%, based on the SCA dataset. Furthermore, the method incorporates a multi-modal decision optimization strategy, based on spatiotemporal correlation rules, leveraging YOLOv7-e6 for simultaneous personnel and small object detection, and introduces the Accuracy of Focused Anomaly Recognition (AFAR) metric to enhance the anomaly detection performance. This approach improves the anomaly detection rate, up to 81.37%, with an overall accuracy of 80.66%, significantly outperforming single-modality solutions.

Keywords:

behavior recognition; Temporal Shift Module (TSM); attention mechanism; EfficientNet; rough set

1. Introduction

A ship’s bridge, serving as the command center, is a critical workspace, equipped with various navigation and communication tools, such as radars, a Global Positioning System (GPS), electronic chart systems, an Automatic Identification System (AIS), a Very High Frequency (VHF) radio communication system, and more. Multiple ship drivers are typically required to operate the ship and process the large amount of information coming from diverse equipment. Ship crew members who violate maritime transportation regulations face severe consequences, including as a result of causing serious injuries, substantial property damage, significant water pollution, and other adverse effects. For instance, on 6 August 2017, a recreational motor cruiser, James 2, and a commercial fishing vessel, Vertrouwen, collided in Sussex Bay, southeast of Shoreham harbor, resulting in James 2 sinking and three of its four crew members drowning, due to ineffective lookouts, a lack of lifejackets, improper navigation lights, and the watchkeeper being distracted by social media and vessel administration [1]. This incident underscores the necessity of the MCA’s MGN 638 [2], which provides essential guidance on maintaining proper lookouts and avoiding distractions to enhance maritime safety. Therefore, it is crucial to enhance ship safety by developing an autonomous recognition system for monitoring crew behavior.

While driver behavior identification in automobiles has made significant progress [3], similar technologies have found limited application within the domain of ship navigation. Compared to automobiles, the environment within a ship’s bridge is more complex. Firstly, the ship’s bridge is notably spacious. In contrast to a car, which typically has only one driver, a ship requires multiple crew members to operate the vessel together. Furthermore, automobile drivers are generally restricted to a seated position when driving. Ship crew may exhibit various postures, such as walking, sitting, or standing, with a significantly broader range of operational areas. Additionally, due to issues related to lighting or occlusion, the accuracy in detecting small objects, such as cigarettes or mobile phones, tends to be lower on a ship. Many algorithms perform excellently in scenarios with well-defined boundaries [4]; however, their efficacy diminishes significantly within the complex environment of a ship’s bridge. The challenges faced in this environment increase the requirements for developing and optimizing behavior recognition algorithms.

Based on the input, behavior recognition algorithms can be classified into two categories: single-frame recognition and multi-frame recognition. Single-frame recognition algorithms attempt to identify actions from a single image. Object detection algorithms can help determine whether a crew member is smoking or playing on a phone by detecting cigarettes, mobile phones, or other items in the crew member’s hand. However, these objects are often small and have ambiguous features, leading to higher chances of false detection, especially in complex environments, like a ship’s bridge. Consequently, object detection algorithms may produce unsatisfactory results, due to their limitations in regard to handling such ambiguity. In contrast, multi-frame behavior recognition algorithms process multiple consecutive frames simultaneously. By analyzing the sequences of frames, these algorithms can effectively extract motion features, such as the act of raising a hand while smoking or turning the steering wheel while driving, along with other temporal dynamics. This capability significantly enhances the feature extraction performance, providing more robust and accurate behavior identification over time. Moreover, the collection and analysis of high-quality data are crucial for maximizing model performance. The use of high-quality data not only enhance the training process, but also ensure that the model can generalize well to new, unseen scenarios, thereby substantially increasing its recognition accuracy.

To address these gaps, this paper introduces a novel cross-modal behavioral intelligence framework, driven by rough set theory, which uniquely integrates spatiotemporal perception and object semantics, for ship bridge environments. Our approach distinguishes itself from prior work in regard to the following three key aspects:

(1) Original dataset construction. We present the first publicly available ship crew behavior dataset, the SCA. This dataset comprises 4788 annotated action sequence samples, captured from video surveillance systems within the ship bridge of passenger vessels operated by Zhoushan Haihua Passenger Transport Co., Ltd. During navigation, these vessels maintain a bridge team consisting of one helmsman, and two lookouts. The SCA dataset covers seven behavior categories: sitting, standing, walking, driving, lying, playing on a phone, and smoking. Each behavior sample contains a minimum of eight consecutive frames to ensure adequate temporal context for action recognition. Additionally, we provide a specialized subset of 3400 images, with detailed object-level annotations (humans, phones, cigarettes), to support feature object detection tasks. By combining comprehensive behavior sequences with precise object annotations in authentic maritime operational contexts, the SCA dataset addresses a significant research gap and enables reproducible studies in regard to maritime computer vision applications.

(2) Hierarchical recognition architecture with enhanced spatiotemporal modeling. We propose a lightweight framework that innovatively combines ByteTrack multi-object tracking with an improved Temporal Segment Network (TSM). Our novel integration of EfficientNet-B3, as the backbone network for computational efficiency, and Coordinate Attention (CA) mechanisms, for spatiotemporal feature enhancement, achieves a Top-1 accuracy of 82.1%, based on the SCA dataset, surpassing existing methods, in complex ship environments.

(3) Rough set-driven decision optimization. Departing from conventional unimodal approaches, we introduce a feature object-assisted method that establishes spatiotemporal association rules between personnel and objects via rough set theory. The proposed joint reasoning engine and multi-source fusion strategy increase the Accuracy of Focused Anomaly Recognition (AFAR) by 7.07% and the Comprehensive Classification Accuracy (CCA) by 2.2% above the baseline methods.

2. Literature Review

2.1. Multi-Object Tracking

Multi-object tracking involves identifying moving targets in each frame of a video series and assigning the same identification label to the same object across successive frames, resulting in separate motion trajectories for individual objects [5].

Wojke et al. [6] proposed the DeepSORT algorithm, which utilized a proprietary residual network trained on pedestrian re-identification datasets to extract deep appearance features mapped onto a hypersphere. These features were then used to determine the minimal cosine distance between identified and tracked targets. During the association phase, the Mahalanobis distance and cosine distance, linearly weighted according to the tracking prediction box generated from Kalman filtering and the detection box in the current frame, were used to compute a cost matrix. Cascade matching was then applied to pair the anticipated boxes with the detected boxes. This compact tracking strategy incorporated motion and appearance information, achieving high multi-object tracking accuracy (MOTA) in real time, while efficiently minimizing occlusion concerns. Zhang et al. [7] presented ByteTrack, a simple, efficient, and general method for data association. Instead of focusing only on high-scoring detection boxes, it associates each detection box with a track by leveraging its resemblance to pre-existing trajectories. This approach utilizes the similarities between low-scoring detection boxes and existing tracks to distinguish genuine targets from background detections. Both high- and low-scoring detection boxes were fully utilized in regard to this method, which can be directly integrated into standard tracking techniques, such as DeepSORT, resulting in varying degrees of improvement. With the growing popularity of Transformers, new techniques like TrackFormer [8] and TransTrack [9] have emerged, which combine Transformers with multi-object tracking. These techniques enhance tracking performance by drawing inspiration from the principles of Transformer encoders and decoders. However, the computational burden they introduce is significant and cannot be overlooked.

Multi-target tracking algorithms currently have a wide range of applications. Hong et al. [10] combined DeepSORT with YOLOv5 in order to follow Bactrocera minax and recognize its grooming behavior. Although this method provided a viable approach for multi-object behavior identification, the tracker still requires significant improvements. Wang et al. [11] used multi-target tracking for pedestrian detection to gather data for autonomous vehicle driving. They enhanced the YOLOv7 backbone network by incorporating spatial depth convolutional layers and applied the concept of convolutional structural reparameterization to create a comprehensive feature extraction block. This optimized block was integrated into the DeepSORT tracker, resulting in an improved system for pedestrian detection. Sukkar et al. [12] combined YOLOv8 with StrongSORT, systematically tuning the parameters across various dimensions to achieve optimal adaptability. From a vehicular perspective, these two techniques show promising results in addressing pedestrian tracking issues. Zhao et al. [13] introduced a new Intersection over Union (IoU) calculation method and applied it to SORT, DeepSORT, and other methods, achieving excellent results in regard to the multi-object tracking of ships in waterways. However, due to the limited experimental circumstances employed to evaluate this strategy, its generalizability cannot be guaranteed.

2.2. Behavior Recognition

With the development of hardware resources in recent years, behavior recognition has become a hot topic. Many methods have added increasingly deeper network layers to improve recognition efficacy, which has led to an explosion in the number of parameters involved. However, using large models to enhance recognition invariably results in a higher threshold for model usage. Achieving a balance between accuracy and model size is crucial in many application scenarios.

Convolutional neural network (CNN)-based behavior recognition algorithms are typically divided into two types, namely 2D CNN and 3D CNN, depending on whether they employ convolution in the time dimension. While 3D CNN-based methods can achieve promising performance, they are computationally intensive. To address this issue, several studies have aimed to reduce the computational complexity and simplify the structure to enhance the real-time performance of 3D convolutions, with notable examples including C3D [14], I3D [15], P3D [16], and R (2 + 1) D [17]. Although these methods simplify the structure and enhance the performance over typical 3D convolutions, they still struggle with real-time processing.

While 2D convolution is less effective than 3D convolution at capturing temporal features, it requires significantly fewer parameters and computational resources. This makes 2D convolution more suitable for real-time applications [18]. Many pieces of research have proposed methods to extract features similar to those achieved with 3D convolution using 2D convolution. For instance, Wang et al. [19] introduced the Temporal Segment Network (TSN), a video classification algorithm based on a 2D CNN. TSN employs a sparse temporal sample strategy and video-level supervision to effectively learn from entire action videos. It consists of spatial stream ConvNets and temporal stream ConvNets that process short, sparsely sampled snippets from the video sequence. Each stream generates an initial action class prediction, and the final video-level prediction is formed by aggregating predictions from these snippets. However, this method does not adequately capture the temporal relationships between snippets, which limits its ability to accurately recognize events defined by precise timing cues.

Lin et al. [20] proposed TSM, a generic and effective module, with high efficiency and performance. TSM shifts part of the channels along the temporal dimension on the feature map, thereby increasing information exchange among neighboring frames. It can be integrated into a 2D CNN to achieve temporal modelling, without the need for additional computation or parameters. TSM performs well in scenarios with clear boundaries. In the context of sharing internal video footage from self-driving cars, Rodrigues et al. [21] proposed a fusion surveillance solution for identifying violent and other unruly behaviors of car occupants, using algorithms such as SlowFast, TSN, and TSM to achieve excellent recognition and real-time performance. Xue et al. [22] suggested a behavior recognition model based on SlowFast and the EPSA attention mechanism to address the issues arising from complex lighting conditions. However, the effectiveness of these methods is significantly hindered in environments with dim lighting and unclear boundaries.

2.3. Feature Extraction Network

Since feature extraction networks form the foundation of image processing algorithms, they are crucial for recognition algorithms. With the advent of ResNet [23], these networks have seen significant improvements in recent years. The primary innovation of ResNet is the introduction of residual connections, which facilitate the training of extremely deep layers, while mitigating the vanishing gradient problem. ResNet has demonstrated impressive performance in tasks such as image classification, object detection, and semantic segmentation. Due to the increased use of embedded systems and mobile devices, there is a growing demand for lightweight 2D feature extraction networks. Traditional deep networks often perform inadequately under resource-constrained conditions, necessitating more efficient solutions. The introduction of the MobileNet [24] series addressed this need. These networks utilize depthwise separable convolutions, which reduce the computational requirements by employing separate depthwise and pointwise convolutions. Consequently, they are well-suited to mobile devices and embedded systems. EfficientNet [25] further advanced this field by proposing automated network structure adjustments. This approach optimizes model performance by balancing the network depth, width, and resolution. Not only does this increase model efficiency, but it also enables automatic adjustments based on the specific requirements of different tasks.

Research on using the EfficientNet network for behavior recognition is currently underway. Luo et al. [26] applied an improved version of the EfficientNet backbone network to classify behavior data collected by LiDAR, albeit based on a limited dataset. Zhou et al. [27] integrated EfficientNet as the feature extraction component in a dual-stream network and enhanced its performance with a multi-head attention mechanism, effectively addressing the challenge of distinguishing similar actions. For detecting car driver behavior, Khan et al. [28] proposed incorporating the EfficientNet-B0 network into a channel attention mechanism. Through the use of comparative analysis with established methodologies, they achieved the highest recognition accuracy with their proposed model. However, it is important to note that channel attention only establishes relationships at the channel level. Incorporating multidimensional attention mechanisms into the backbone network could further improve model performance.

2.4. Rough Set

Rough set (RS) theory, introduced by Zdzislaw Pawlak [29], is a mathematical tool to deal with vagueness and uncertainty. The core concept of RS theory involves utilizing existing knowledge bases to approximate and characterize imprecise or uncertain information through established knowledge repositories. This method provides an objective way to describe and process uncertainties without requiring any prior information beyond the dataset being processed. It relies solely on intrinsic data relationships derived from equivalence classes and indiscernibility relations.

Integrating RS with machine learning algorithms can enhance the accuracy of behavior classification. By learning from historical behavior data, systems can predict behaviors that may occur in the future. Senthilnathan [30] developed a three-level model based on RS to understand complex Indian consumer retail shopping behavior in the grocery segment, treating behavioral traits as rules. Henry [31] incorporated RS with reinforcement learning, which could be applied to a monocular vision system for tracking moving targets and studying swarm behavior. Chu et al. [32] introduced a multi-modal and multi-criteria conflict analysis approach, combining RS with deep learning. This approach used a deep residual network as a feature extractor and classified objects according to decision attributes derived from RS. Sanchez-Riera et al. [33] made use of deep learning to train a set of gestures, which served as a rough estimate of hand pose and orientation. Bell et al. [34] proposed a video mining learning pattern of behavior, granulating detailed action sequences into molecular behavioral units, then applying RS for attribute reduction and extracting decision rules from the resulting feature table, enabling interpretable pattern mining, validated through preliminary experiments. The generated rule sets are highly explainable and can be directly translated into decision-making criteria for behavior recognition.

The previous discussion highlights that traditional behavior recognition methods struggle to capture the temporal evolution characteristics of actions, typically achieving recognition only in narrow-domain single-person scenarios. Monomodal visual approaches exhibit weak discrimination capabilities for similar behaviors and fail to resolve ambiguities within actions. Consequently, this paper integrates ByteTrack with TSM to address real-time multi-target tracking and behavior recognition challenges in multi-driver scenarios. Additionally, it introduces a decision optimization method based on RS, effectively distinguishing ambiguous actions, such as smoking and playing on a phone.

3. The Proposed Approach

The bridge of a ship is a unique environment, with several drivers operating the vessel simultaneously. As illustrated in Figure 1, this paper proposes a cross-modal behavioral intelligence recognition approach.

The procedure for this approach is as follows: Firstly, the YOLOv7 model is utilized for the detection of individuals and feature objects. Subsequently, the ByteTrack algorithm is introduced for the tracking of multiple drivers, with an improved TSM method applied for the preliminary recognition of target behaviors. To address the challenge in regard to differentiating between two types of abnormal driver behaviors, namely smoking and playing on a phone, an optimization approach aided by feature objects is proposed based on RS. This involves the development of personnel object spatiotemporal correlation rules and the implementation of a joint reasoning engine for actions and feature objects, leading to enhanced identification accuracy in terms of these similar behaviors.

The Multi-person behavior recognition algorithm is given as Algorithm 1:

Algorithm 1 Multi-person behavior recognition algorithm

Input: Real-time video stream from ship bridge cameras
Output: Anomaly alerts with behavior classification

1: Initialize YOLOv7 model, ByteTrack tracker, Improved TSM algorithm, Decision Optimization Engine
2:      FOR each frame in video_stream DO
3:   Perform Personnel and Feature Object Detection (frame)
4:   Perform Personnel and Feature Object Localization (frame, YOLOv7)
5:   IF Personnel detected THEN
6:   Perform Personnel Tracking Support (frame, ByteTrack)
7:   Assign Feature Object Scores (frame, YOLOv7)
8:   ENDIF
9: Perform Spatiotemporal Alignment Module (frame)
10: Associate Personnel–Feature Object (frame, tracking results, feature scores)
11: Perform Behavior Recognition (frame, Improved TSM)
12: Calculate Action Score (behavior recognition results)
13: Generate Anomaly Alert (action score, Decision Optimization Engine)
14:      END FOR
15: RETURN Anomaly Alerts

3.1. Personnel and Feature Object Detection Based on YOLOv7

The YOLO series of detection algorithms are well-known for their lightweight and efficient performance, making them highly suitable for edge scene requirements. In each frame, YOLOv7 is used to identify all the drivers and abnormal objects, such as mobile phones and cigarettes. For the detected individuals, the ByteTrack algorithm is employed for multi-target tracking.

ByteTrack operates based on tracking according to the detection paradigm. It achieves person tracking by associating all the person detection boxes and assigns unique identifiers (IDs) to each individual. Additionally, ByteTrack stores continuous action frames of individuals, providing essential data support for behavior recognition. This combination of YOLOv7 and ByteTrack ensures both accurate detection and robust tracking, laying a solid foundation for subsequent behavior analysis.

3.2. Improved TSM for Behavior Recognition

Considering the particularities of a ship’s bridge, this research improves the TSM model, as depicted in Figure 2. The process involves several stages: random sampling, feature extraction, feature augmentation, and classification. Firstly, videos are divided into multiple segments at the random sampling layer, where one frame is randomly selected from each segment to serve as the pre-input for the model. These sample frames then undergo pre-processing using various image enhancement techniques, such as padding, random cropping, and flipping, etc. This pre-processing step helps maintain relevant feature information, while reducing noise and improving the model’s convergence rate. In the feature extraction layer, the EfficientNet-B3 network is employed as the core feature extraction module. Due to its reduced parameter count, it significantly reduces the computational costs, while enhancing recognition performance. After obtaining the final feature map, we introduce a feature enhancement layer that models the feature map in two dimensions, using the CA module. This allows the network to simultaneously extract internal potential connections in regard to both channel and spatial dimensions. Finally, the feature map is classified using a fully connected layer acting as a classifier. The fused final category score is determined by uniformly averaging the outputs of a set of frames through consensus. The consensus mechanism plays a pivotal role in this framework. Given that the model processes multiple frames sampled from video segments, their respective outputs may exhibit slight variations. To achieve robust result aggregation, the proposed method employs consensus-based averaging of classification scores across all the frame outputs. This critical processing stage effectively utilizes the inherent redundancy in multi-frame information to mitigate individual frame prediction errors, thereby generating more stable and accurate fused category scores.

Table 1 outlines the structural parameters of the behavior recognition network, with each row describing a stage

i

containing

L_{i}

layers, input resolution

H_{i} * W_{i}

, and output channels

C_{i}

.

In regard to the classical TSM, random cropping is employed to augment the training samples. Nonetheless, the bridge images typically have a rectangular shape and are essentially consistent with the characters. A direct random crop of the source image will remove the majority of the character information. The robustness of the model will suffer if random cropping is not applied. In order to guarantee data randomization and prevent the loss of significant image features, we fill the image grayscale according to a 1:1 ratio, before random cropping.

Similarly, we apply an expansion coefficient and first enlarge the target box, before intercepting it when capturing the target image from the original video frame, after the target tracking stage. The reason for this is that there may be some character movements in the original video that overlap with the background. For instance, the driving action and the wheel usually overlap, so you will miss vital details if you simply clip and fit the person’s image. As a result, we intercept the extended target image, fill it with grayscale, crop it randomly, and then use the model to make inferences.

The core of the model is the feature extraction network, which has a significant influence on recognition performance. Because of its distinct residual structure, the ResNet series feature extraction network typically improves the model’s performance as the number of stacking layers rises. Nevertheless, the model’s size directly affects how difficult it is to deploy. Before heedlessly expanding the model size to improve the effect, it is imperative to keep the balance between the scale and the effect. The model performance can be impacted by the width, depth, and resolution of the feature extraction network. EfficientNet involves a new scaling method that uniformly scales all the dimensions of depth/width/resolution using a simple yet highly effective compound coefficient. Google provided eight combination structures with progressively larger scales, ranging from EfficientNet-B0 to EfficientNet-B7. The structures Efficientnet-B0 through to Efficientnet-B7 have much fewer parameters and calculations than the ResNet series of feature extraction networks. Furthermore, EfficientNet models outperformed ResNet networks based on public datasets of comparable size.

This research compares the performance of different feature extraction networks based on the self-constructed SCA dataset, and the results are shown in Figure 3. The horizontal axis reflects the computational complexity of the backbone network, while the Top-1 accuracy of the behavior identification is represented by the vertical axis. EfficientNet-B3 is easily found to have the best effect, as a result of the comparison. It achieves great recognition accuracy, with a significant reduction in computational complexity and the number of parameters involved, as compared to the ResNet series of networks. As a result, the feature extraction network used in this study is EfficientNet-B3.

The size of the feature map keeps getting smaller, the number of channels keeps growing, and the features of the model extracts get more abstract, as the feature extraction network gets deeper. As a result, a lot of the features are redundant and are not useful for classification. Thus, we insert a feature improvement layer before the completely connected layer. It primarily employs a CA module to improve the network’s representation of useful information, while weakening irrelevant information for classification. The coordinate information embeds positional information. Specifically, it exploits two 1D global pooling operations to aggregate the input features, along the vertical and horizontal directions, respectively, into two separate direction-aware feature maps. These two feature maps with embedded direction-specific information are then separately encoded into two attention maps, each of which captures the long-range dependencies of the input feature map along one spatial direction. The positional information can thus be preserved in the generated attention maps. Both attention maps are then applied to the input feature map via multiplication to emphasize the representations of interest.

3.3. Multi-Source Feature Fusion Based on RS

Following the localization and recognition of individuals and characteristic objects, it is imperative to establish their associations across both temporal and spatial dimensions. From a temporal perspective, consideration must extend beyond the current frame of the drivers being tracked. It is necessary to comprehensively analyze all the feature objects within the temporal span covered by the tracking drivers. This approach ensures a holistic understanding of the drivers’ dynamics over time. Spatially, assessing the intersection between the bounding boxes of the tracking drivers and those of the feature objects is crucial for determining their spatial relationships accurately.

Specifically, for a tracking target

P_{i}

in the frame

t

, its associated feature object set

O_{i}^{t}

is

O_{i}^{t} = \{o_{j}| I o U (B_{p}^{t}, B_{o}^{t}) > δ\}

(1)

where

B_{p}^{t}

denotes the bounding box of the tracked driver in the frame

t

, while

B_{o}^{t}

represents the bounding box of a feature object appearing in the frame

t

.

I o U (B_{p}^{t}, B_{o}^{t})

indicates the IoU between the driver and the abnormal object. The parameter

δ

denotes the threshold, which is set to 0.3.

Within a behavior sampling period, the associated characteristic object set

O_{i}

for

P_{i}

is defined as the union of all the frames’ associated characteristic object sets in the sampled frame sequence

{F_{t - n}, F_{t - n + 1}, \dots, F_{t}}

_.

O_{i} = ⋃_{k = t - n}^{t} O_{i}^{k}

(2)

where

O_{i}^{k}

represents the set of associated characteristic objects for

P_{i}

in the frame

k

, and

n

denotes the sampling length for behavior recognition. By employing the spatiotemporal alignment module to detect and associate abnormal characteristic objects in each frame, the model can effectively mitigate the issue of insufficient sensitivity to instantaneous behaviors.

Table 2 is a decision table formulated based on feature data extracted through the use of spatiotemporal association rules and behavior classification outcomes derived from the improved TSM. The condition attributes ensemble comprises:

(1) The Confidence of Associated Objects. Denoted as

C_{o}

, this metric represents the average confidence level of associated features of the same category within the sampling time frame, pertinent to the driver under observation.

(2) The Coverage Rate of Anomalous Features. Specifically defined for objects such as cigarettes,

S_{o}

denotes the coverage ratio of associated anomalous objects during the behavioral sampling period of the driver under scrutiny. Mathematically, it is expressed as:

S_{o} (c i g a r e t t e) = \frac{n_{1}}{n}

(3)

where

n_{1}

denotes the number of frames in which the associated feature object (e.g., cigarette) appears, and

n

represents the total number of sampled frames. An elevated feature coverage rate signifies a higher frequency of interactions between the observed drivers and the feature objects, which is positively correlated with the likelihood of the corresponding behavior’s occurrence.

(3) The Behavior Confidence.

C_{b}

refers to the output generated by the improved TSM model concerning the confidence level of the detected behaviors.

The Anomalous Behavior Set

D_{a}

includes non-driving behaviors (smoking), non-driving behaviors (playing on a phone), distracted driving (smoking), and distracted driving (playing on a phone). The basic mutually exclusive behavior set

D_{b}

encompasses four fundamental states: sitting, standing, walking, and lying. These mutually exclusive behaviors are foundational for classifying various activities, providing a clear distinction between different types of behaviors observed within the study context.

Due to the significant similarities in motion characteristics between smoking and playing on a phone, such as the interaction of specific objects with the person’s head and the pronounced arm movements involved in both answering a phone call and bringing a cigarette to the person’s mouth. The TSM model, despite its efficacy in extracting temporal features from adjacent frames, encounters challenges in accurately distinguishing the subtle local differences between these two behaviors. This limitation often results in confusion, suggesting a potential action association between smoking and playing on a phone during the behavior classification process. To address this issue, the confidence levels of the two behaviors are proportionally integrated to generate an action score. This action score is subsequently combined with the object feature score to compute the final abnormal behavior score for classification purposes. The decision rule set is as follows:

D_{a} = \{\begin{matrix} n o n - d r i v i n g (p l a y i n g o n a p h o n e) I F C_{b} (d r i v i n g) < τ \land M (p l a y i n g o n a p h o n e) > φ \\ n o n - d r i v i n g (s m o k i n g) I F C_{b} (d r i v i n g) < τ \land M (s m o k i n g) > φ \\ d i s t r a c t e d d r i v i n g (s m o k i n g) I F C_{b} (d r i v i n g) < τ \land M (s m o k i n g) > φ \\ d i s t r a c t e d d r i v i n g (p l a y i n g o n a p h o n e) I F C_{b} (d r i v i n g) < τ \land M (p l a y i n g o n a p h o n e) > φ \end{matrix}

(4)

where

φ

denotes the threshold score for identifying abnormal behaviors, while

τ

represents the threshold score for assessing the driving state. Specifically,

M (s m o k i n g)

and

M (p l a y i n g o n a p h o n e)

refer to the scores for abnormal behaviors corresponding to smoking and playing on a phone, respectively, which are defined as follows:

M (s m o k i n g) = α \cdot S_{o} (cigarette) \cdot C_{o} (cigarette) + β \cdot C_{b} (s m o k i n g) + γ \cdot C_{b} (playing on a phone)

(5)

M (p l a y i n g o n a p h o n e) = α \cdot S_{o} (phone) \cdot C_{o} (phone) + β \cdot C_{b} (playing on a phone) + γ \cdot C_{b} (smoking)

(6)

where

α

denotes the weight coefficient associated with the feature score, whereas

β

and

γ

represent the weight coefficients corresponding to the direct and indirect action scores, respectively. Utilizing the grid search method within the SCA dataset, with the Comprehensive Classification Accuracy (CCA) as the evaluation metric, and setting the search range to [0, 1], with a step size of 0.05, it is experimentally determined that the highest CCA of 89.4% is attained with the parameter configuration

(α, β, γ, φ, τ) = (0.55,0.25,0.2,0.7,0.7)

.

As discussed above, the basic mutually exclusive behavior set

D_{b}

comprises four behaviors: sitting, standing, walking, and lying. Given the mutual exclusivity of these behaviors, which dictates that only one behavior can occur at any given time, the final classification of an observed behavior is determined by identifying the behavior associated with the highest confidence score

C_{b}

.

D_{b} = \arg m a x (C_{b} (s i t t i n g), C_{b} (s t a n d i n g), C_{b} (w a l k i n g), C_{b} (l y i n g))

(7)

Therefore, if

C_{b} (s i t t i n g)

is the highest,

D_{b}

is classified as sitting; if

C_{b} (s t a n d i n g)

is the highest,

D_{b}

is classified as standing; and, similarly, for each behavior within the set.

The Top-1 accuracy is effective for evaluating the recognition effect of mutually exclusive behaviors. However, it inadequately assesses the detection effectiveness of anomalous behaviors in scenarios where basic behaviors and abnormal behaviors overlap. To more comprehensively evaluate the anomaly alerting capability of the methods, an Anomaly-Focused Accuracy Rate (AFAR) metric has been designed based on the following principles:

(1) Decoupling the influence of the driving states. This involves disregarding the accuracy of distinguishing between driving states (non-driving vs. distracted driving) in regard to the classification of anomalous behaviors, focusing instead solely on the identification of specific anomalous behavior categories (e.g., smoking, playing on a phone).

(2) The differential penalty mechanism. Considering the practical application context, this mechanism imposes differential penalties for cross-category misjudgments (e.g., misclassifying smoking as playing on a phone) and missed detections (the failure to detect anomalous behavior), based on the severity of the misjudgment consequences. This ensures that the evaluation not only considers correct identifications, but also appropriately penalizes errors, according to their impact.

The AFAR is defined as follows:

A F A R = \frac{N_{c o r r e c t}}{N_{c o r r e c t} + N_{c o n f u s i o n} + {2 N}_{m i s s}}

(8)

where

N_{c o r r e c t}

represents the number of samples correctly classified into the anomaly behavior categories,

N_{c o n f u s i o n}

denotes the number of misclassifications, and

N_{m i s s}

indicates the number of anomaly samples not detected. Given that missed detections generally result in more severe consequences than misclassifications in real-world scenarios, the formula imposes a heavier penalty for misses to reflect their impact. This AFAR metric only evaluates the detection effectiveness based on anomaly samples.

In order to prevent an overemphasis on anomaly classification at the expense of normal sample detection, an evaluation metric CCA that comprehensively considers both anomaly and normal behavior detection accuracies is introduced, which is given by:

C C A = \frac{N_{c o r r e c t_n o r m a l} + N_{c o r r e c t_a n o m a l y}}{N_{t o t a l}}

(9)

where

N_{c o r r e c t_n o r m a l}

represents the number of normal samples correctly identified as not exhibiting any anomaly behavior,

N_{c o r r e c t_a n o m a l y}

denotes the number of anomaly samples that are correctly classified, and

N_{t o t a l}

is the total number of samples.

This approach ensures that the evaluation framework not only emphasizes the critical aspect of detecting anomalous behaviors accurately, but also maintains high standards for the correct identification of normal behaviors, thereby providing a comprehensive assessment of the model’s performance.

4. Experiments

4.1. Datasets

To validate the proposed approach, a dataset is utilized of footage from CCTV surveillance systems installed on ship bridges. This dataset, designed to enhance model robustness, includes a diverse array of bridge environments. The video frames are meticulously cropped and categorized with behavior labels to create a specialized dataset for driver behavior detection.

We have created a ship bridge behavior SCA dataset. The dataset encompasses seven distinct behaviors: sitting, standing, walking, lying, driving, smoking, and playing on a phone. It comprises approximately 4788 samples in total, with each sample containing at least eight consecutive action frames. The samples are evenly distributed across the seven behavioral classes to ensure balanced representation. Table 3 shows the sample distribution of this dataset, while Figure 4 shows some of the samples.

Additionally, from the SCA dataset, frames containing feature objects are extracted and annotated to construct a subset for personnel and object detection. This subset consists of 3400 images, covering various cockpit scenarios and includes three object categories: humans, mobile phones, and cigarettes.

4.2. Implementation Details

An experimental platform is established based on the Windows 10 operating system, leveraging a Nvidia GeForce RTX 3060 GPU, with 12 GB of memory, and an Intel Core i512400F CPU, for computational tasks. The framework utilizes PyTorch version 1.13+cu116. The SCA dataset is employed for testing purposes, with the relevant training parameters detailed in Table 4. Considering the limited size of the SCA dataset, transfer learning is adopted in each experiment to mitigate the risk of model overfitting. Specifically, the feature extraction networks ResNet and EfficientNet are utilized, with the weights pretrained based on the ImageNet [35] dataset, followed by fine tuning based on the SC Action dataset. This methodology ensures the enhanced generalization capabilities of the model, despite the constrained dataset size.

4.3. Results

4.3.1. Results of Personnel and Feature Object Detection

The subset is partitioned into training and validation sets, with a ratio of 8:2. The detection performance of various YOLOv7 models are evaluated based on this dataset, as shown in Table 5. The experimental results presented in Table 5 reveal an important trade-off in regard to model performance. The baseline YOLOv7 model achieves the fastest processing speed, with a latency of just 12.5 milliseconds, owing to its smaller 640 px × 640 px input resolution. However, this advantage comes at the cost of a reduced detection capability, particularly for small objects like cigarettes and mobile phones that appear at low resolution in the input frames. The YOLOv7-e6 variant demonstrates superior performance, with a recall rate of 0.745, and achieves the highest mAP@0.5:0.95 score of 0.514. This model exhibits particularly impressive efficiency, delivering comparable mAP@0.5 accuracy, while achieving an 18.6% reduction in inference latency relative to the computationally heavier YOLOv7-d6 architecture. The YOLOv7-e6e model, while attaining peak precision of 0.838 among all the tested models, suffers from a substantially lower recall rate of 0.703. This performance imbalance raises significant concerns regarding potential missed detections of critical feature objects, which could severely compromise the system’s ability to accurately identify and classify anomalous behaviors in practical deployment scenarios.

After a comprehensive evaluation of the detection accuracy, small-object sensitivity, and deployment feasibility, we selected YOLOv7-e6 as the core detection model. Its superior recall rate and robust feature extraction capabilities provide a stable and reliable foundation for target localization, which is critical for subsequent behavior recognition tasks. This architecture demonstrates an optimal balance between detection performance in regard to challenging small targets and practical implementation requirements.

Therefore, the YOLOv7-e6 model is selected for object detection throughout this research, providing a robust and reliable foundation for object localization. This, in turn, supports subsequent behavior recognition tasks.

4.3.2. Results of Driver Tracking

The tracking efficacy is affected by the confidence threshold and the IoU threshold. Therefore, experiments are conducted to evaluate the tracking performance under various configurations of these parameters. The assessment is based on three metrics: MOTA, precision, and recall. As shown in Figure 5, the optimal performance is achieved when both the confidence threshold and the IoU threshold are set to 0.4, resulting in an MOTA score of 0.957, a precision of 0.97, and a recall of 0.991.

4.3.3. Results of Behavior Recognition

In regard to the behavior recognition module, a comparative analysis is conducted among the classical TSM, the improved TSM, and the popular SlowFast algorithm, using the SCA dataset. The comparison is based on three dimensions: GMAC, the number of parameters, and Top-1 accuracy. As can be observed from Table 6, the improved TSM algorithm demonstrates superior performance compared to both the original TSM model and the SlowFast model.

To further validate the effectiveness of the proposed improved TSM method, ablation experiments are conducted, and the results are presented in Table 7.

The baseline model M1 is based on the TSM architecture, with the EfficientNet-B3 backbone. M2 improves upon M1 by incorporating image augmentation, utilizing a super-resolution reconstructed dataset, and refining the pre-processing strategy. As a result, the recognition accuracy increases by 1.29 percentage points. Building on M2, M3 introduces the CA attention module into the feature extraction layers. With only a marginal increase in the computational cost, this modification effectively enhances the model’s representational capability, achieving a Top-1 accuracy of 82.1%.

To comprehensively evaluate the recognition performance of the improved TSM model, comparative experiments are conducted on two widely used benchmark datasets, UCF101 and HMDB51, which contain more diverse action categories than typical evaluation datasets. The Top-1 accuracy serves as the primary evaluation metric, with the detailed comparative results systematically presented in Table 8.

The improved TSM is benchmarked against state-of-the-art approaches to action recognition, including the 2D baseline TSN, I3D, and Vision Transformer-based methods [38]. The experimental results demonstrate that the improved TSM model achieves a competitive Top-1 accuracy of 96.7% and 75.1% based on the UCF101 and HMDB51 datasets, respectively. While slightly lower than the ViT-based AMD method, it exhibits significant advantages in regard to its lightweight design, with substantially fewer parameters, namely 10.93 M, compared to Transformer-based architectures.

The feature extraction capability is a critical determinant of a model’s performance. By comparing the shallow feature maps of identical input samples across different model architectures, deeper insights into the initial stages of feature recognition can be gained, as illustrated in Figure 6.

In regard to the TSM–ResNet series of models, the shallow feature maps exhibit a notable homogenization tendency. As shown in Figure 6b,c, the feature responses at different locations show a high level of similarity. This phenomenon indicates that the model fails to effectively capture the multidimensional characteristics of the input data in regard to the initial layers, thereby limiting the richness and expressiveness of subsequent deep features. Figure 6 demonstrates that the improved model has solved this problem, while simultaneously increasing the feature map diversity.

To evaluate the real inference effect, comparisons are made between the TSM–ResNet50 model and the improved TSM model. For an intuitive comparison, the behaviors are sorted in descending order of likelihood and displayed within the corresponding detection boxes, as shown in Figure 7. The probability of the driving behavior is represented by the action probability, which uses a sigmoid function to map behavior classification scores in the range (0,1). When the action probability is less than 0.5, the model predicts that the target does not exhibit the behavior, indicating a negative confidence score. As the action’s probability increases, the model interprets this as a higher likelihood of the target exhibiting the behavior.

From Figure 7, it is evident that the TSM–ResNet50 model inaccurately recognizes the left driver’s sitting behavior (indicated by the red box), suggesting instead that the driver is more likely standing. In contrast, the improved TSM model demonstrates a greater likelihood of correctly identifying the sitting behavior compared to standing. Although the standing confidence level remains relatively high, this suggests that the improved model is moving in the right direction, towards more accurate behavior recognition.

The experiments also involved a class-wise accuracy comparison across different behavior categories, as shown in Table 9.

The improved TSM model achieves performance gains in all the categories, with the most notable improvement in the driving category, where the accuracy reaches 91.35%, representing an increase of 9.21 percentage points. This improvement is primarily attributed to the collaborative modeling of the bounding box expansion and the CA attention mechanism, which effectively captures contextual information from the dashboard background. In contrast, the model performs relatively less well in regard to abnormal behaviors, such as playing on a phone and smoking, suggesting that further optimization is needed to better address the unique characteristics of these behaviors.

4.3.4. Multi-Source Feature Fusion Inference Results

To validate the effectiveness of the proposed multi-source feature fusion inference method, a comparison is conducted between the pre-optimization and post-optimization decision-making processes. Prior to decision optimization, the output of the method consists of normalized confidence scores for behaviors, which are not directly suitable for anomaly detection. Consequently, a decision function must be applied to transform these confidence scores into prediction results.

D_{p r e} = \{\begin{matrix} {s m o k i n g}_{\land C_{b} (s m o k i n g) > C_{b} (p l a y i n g o n a p h o n e)}^{I F C_{b} (s m o k i n g) > φ_{p r e}} \\ {p l a y i n g o n a p h o n e}_{\land C_{b} (p l a y i n g o n a p h o n e) > C_{b} (s m o k i n g)}^{I F C_{b} (p l a y i n g o n a p h o n e) > φ_{p r e}} \end{matrix}

(10)

where

D_{p r e}

represents the anomaly behavior classification without employing multi-source feature fusion. The optimal parameter

φ_{p r e} = 0.65

is determined through the use of a grid search method. In contrast, the method utilizing multi-source feature fusion initially decouples the driving state and, subsequently, evaluates it using the AFAR. The test data are derived from the SCA dataset, comprising a total of 954 samples, including 306 anomaly samples. Among these anomaly samples, 184 instances involve a person smoking and 122 instances involve a person playing on a phone. The relevant evaluation metrics and detailed test data are summarized in Table 10.

To evaluate the reasoning effectiveness of the optimized decision-making method, two surveillance video segments featuring a person smoking and a person playing on a phone are employed for testing. As illustrated in Figure 8a, a driver is observed holding a cigarette. However, due to the absence of critical smoking action features, the action score remains low. Consequently, after feature fusion, the behavior is correctly classified as normal. In Figure 8b, a diver exhibits key action characteristics indicative of smoking, which increases the action score, thereby elevating the overall behavioral score beyond the predefined threshold, resulting in successful anomaly detection. Moreover, fundamental mutually exclusive behaviors, such as sitting and non-driving, are accurately identified. Prior to its application, the lack of auxiliary judgment based on object features often leads to inaccurate behavior categorization by the model. Figure 8c depicts a typical ship’s bridge scene filled with electronic instruments, where a driver is partially obscured by another passing individual, while positioned at the control panel. The proposed feature fusion approach accurately identifies both the driver and the mobile phone by considering the driver’s actions and their association with object features comprehensively. Consequently, the driver’s behavior is successfully categorized as standing, distracted driving (playing on a phone).

Furthermore, a surveillance video containing multiple drivers is selected for analysis to evaluate the effectiveness of the proposed method in crowded scenarios. As illustrated in Figure 9, the method demonstrates strong capabilities in regard to person localization and behavior recognition, even under conditions involving severe occlusion. In Figure 9a, despite the mobile phone being nearly indistinguishable from the background, it is successfully detected. By combining this detection with the distinct postural cue of the driver bowing their head, the system accurately identifies two drivers playing on a phone. Similarly, in Figure 9b, the method detects a cigarette near the crew member’s mouth. By incorporating contextual information from the control panel, the model correctly classifies the behavior as standing–distracted driving (smoking).

These results demonstrate the robustness and practical applicability of the proposed approach in regard to complex operational environments. The multi-level feature fusion strategy effectively addresses key challenges such as target occlusion overlapping and weakened feature representation in dense scenes.

5. Conclusions

This paper presents a novel rough set-driven cross-modal framework for intelligent recognition method of multi-driver behaviors in ship bridge scenarios. Our work makes three foundational contributions to maritime behavior analysis: (1) original algorithmic integration. We propose an architecture combining ByteTrack multi-driver tracking with an improved TSM, enhanced by a spatiotemporal-aware CA mechanism. This design significantly outperforms unimodal approaches, achieving an 81.37% AFAR and 80.66% CCA, significantly outperforming methods relying solely on visual modality. (2) Pioneering feature object association. We introduce a rough set-based joint inference model that establishes spatiotemporal correlation rules between personnel and objects. This innovation enables interpretable reasoning for abnormal behaviors like smoking or playing on a phone, addressing a critical gap in maritime safety systems. (3) A benchmark dataset and metrics. The SCA dataset, the first publicly available ship bridge behavior dataset, with 4788 annotated sequences and 3400 feature object images, provides a foundation for reproducible research. The proposed AFAR metric offers a novel evaluation protocol for anomaly focused maritime applications. By unifying rough set theory, spatiotemporal perception, and object semantics, this work establishes a new paradigm for intelligent ship bridge monitoring, with implications for autonomous navigation and human–AI collaboration in maritime environments.

Future research will focus on the following aspects: (1) Given the critical importance of datasets for model training, the current sample size of the SCA dataset is insufficient. In the future, it is necessary to collect more data under extreme weather conditions and nighttime scenarios. Considering the introduction of Generative Adversarial Networks (GANs) for data augmentation could enhance the generalization capabilities. Additionally, this research utilizes a default frame sampling rate; the experimental findings indicate that increasing the sampling frame rate can effectively improve the temporal modeling performance. Therefore, future dataset expansions should prioritize samples with higher frame rates over those with lower frame rates. (2) The existing system primarily relies on visual modalities and could benefit from the integration of multi-modal data, such as millimeter-wave radar and cabin sensors. Exploring the application of voiceprint recognition technology to detect abnormal sounds and incorporating eye-tracking techniques to assess driver distraction levels are promising avenues for enhancing system robustness and reliability. (3) Considering the characteristics of embedded devices onboard ships, future research should focus on Neural Architecture Search (NAS) technologies to design lightweight models specifically tailored for maritime applications. Investigating model quantization and knowledge distillation strategies could further compress models, while maintaining high accuracy, making them suitable for deployment in resource-constrained environments.

Author Contributions

Conceptualization, C.C. and Y.W.; methodology, Y.W.; software, F.M.; validation, Z.S. and Y.W.; formal analysis, C.C.; investigation, F.M.; resources, F.M.; data curation, Y.W.; writing—original draft preparation, C.C. and Y.W.; writing—review and editing, C.C., F.M., and Y.W.; visualization, Z.S.; supervision, F.M.; project administration, C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 52201415), the Fund of State Key Laboratory of Maritime Technology and Safety (No. 16-10-1), and the National Key R&D Program of China (Grant No. 2023YFB4302300).

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board of School of Computer Science and Engineering, Wuhan Institute of Technology, and State Key Laboratory of Maritime Technology and Safety, Wuhan University of Technology.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed at the corresponding author.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

References

Marine Accident Investigation Branch. *Report on the collision between James 2 and Vertrouwen, Sussex Bay, 6 August 2017* (Report No. 3/2018). UK Government. 2018. Available online: https://www.gov.uk/maib-reports/collision-between-fishing-vessel-vertrouwen-and-motor-cruiser-james-2-resulting-in-motor-cruiser-sinking-with-loss-of-3-lives (accessed on 7 March 2018).
Maritime and Coastguard Agency (MCA). *MGN 638 (M+F): Human Element Guidance—Part 3: Distraction—The Fatal Dangers of Mobile Phones and Other Personal Devices When Working*. UK Government. 2023. Available online: https://www.gov.uk/government/publications/mgn-638-mf-amendment-1-human-element-guidance-part-3-distraction/mgn-638-mf-amendment-1-human-element-guidance-part-3-distraction-the-fatal-dangers-of-mobile-phones-and-other-personal-devices-when-working (accessed on 13 March 2024).
Zhang, J.; Wu, Z.; Li, F.; Luo, J.; Ren, T.; Hu, S.; Li, W.; Li, W. Attention-Based Convolutional and Recurrent Neural Networks for Driving Behavior Recognition Using Smartphone Sensor Data. IEEE Access 2019, 7, 148031–148046. [Google Scholar] [CrossRef]
Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. SlowFast Networks for Video Recognition. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6201–6210. [Google Scholar]
Luo, W.; Xing, J.; Milan, A.; Zhang, X.; Liu, W.; Kim, T.-K. Multiple Object Tracking: A Literature Review. Artif. Intell. 2021, 293, 103448. [Google Scholar] [CrossRef]
Wojke, N.; Bewley, A.; Paulus, D. Simple Online and Realtime Tracking with a Deep Association Metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. In Proceedings of the Computer Vision—ECCV 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer Nature: Cham, Switzerland, 2022; pp. 1–21. [Google Scholar]
Meinhardt, T.; Kirillov, A.; Leal-Taixé, L.; Feichtenhofer, C. TrackFormer: Multi-Object Tracking with Transformers. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8834–8844. [Google Scholar]
Sun, P.; Jiang, Y.; Zhang, R.; Xie, E.; Cao, J.; Hu, X.; Kong, T.; Yuan, Z.; Wang, C.; Luo, P. TransTrack: Multiple-Object Tracking with Transformer. arXiv 2020, arXiv:2012.15460. [Google Scholar]
Hong, S.; Zhan, W.; Dong, T.; She, J.; Min, C.; Huang, H.; Sun, Y. A Recognition Method of Bactrocera Minax (Diptera: Tephritidae) Grooming Behavior via a Multi-Object Tracking and Spatio-Temporal Feature Detection Model. J. Insect Behav. 2022, 35, 67–81. [Google Scholar] [CrossRef]
Wang, H.; Jin, L.; He, Y.; Huo, Z.; Wang, G.; Sun, X. Detector–Tracker Integration Framework for Autonomous Vehicles Pedestrian Tracking. Remote Sens. 2023, 15, 208. [Google Scholar] [CrossRef]
Sukkar, M.; Shukla, M.; Kumar, D.; Gerogiannis, V.C.; Kanavos, A.; Acharya, B. Enhancing Pedestrian Tracking in Autono-mous Vehicles by Using Advanced Deep Learning Techniques. Information 2024, 15, 104. [Google Scholar] [CrossRef]
Zhao, H.; Wei, G.; Xiao, Y.; Xing, X. Multi-Ship Tracking by Robust Similarity Metric. In Proceedings of the 2023 IEEE International Conference on Mechatronics and Automation (ICMA), Harbin, China, 6–9 August 2023; pp. 2151–2156. [Google Scholar]
Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]
Carreira, J.; Zisserman, A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4724–4733. [Google Scholar]
Qiu, Z.; Yao, T.; Mei, T. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5534–5542. [Google Scholar]
Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 July 2018; pp. 6450–6459. [Google Scholar]
Hao, F.-F.; Liu, J.; Chen, X.-D. A Review of Human Behavior Recognition Based on Deep Learning. In Proceedings of the 2020 International Conference on Artificial Intelligence and Education (ICAIE), Tianjin, China, 26–28 June 2020; pp. 19–23. [Google Scholar]
Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Van Gool, L. Temporal Segment Networks for Action Recognition in Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 2740–2755. [Google Scholar] [CrossRef] [PubMed]
Lin, J.; Gan, C.; Han, S. TSM: Temporal Shift Module for Efficient Video Understanding. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7082–7092. [Google Scholar]
Rodrigues, N.R.P.; Da Costa, N.M.C.; Melo, C.; Abbasi, A.; Fonseca, J.C.; Cardoso, P.; Borges, J. Fusion Object Detection and Action Recognition to Predict Violent Action. Sensors 2023, 23, 5610. [Google Scholar] [CrossRef] [PubMed]
Xue, H.; Chen, D.; Fan, L.; Mao, Z.; Zhao, J.; Chen, Z. A Ship Pilot On-Duty State Recognition Model Based SlowFast Network and Attention Mechanism. In Proceedings of the 2023 7th International Conference on Transportation Information and Safety (ICTIS), Xi’an, China, 4–6 August 2023; pp. 147–151. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2019, arXiv:1905.11946. [Google Scholar] [CrossRef]
Luo, C.-Y.; Cheng, S.-Y.; Xu, H.; Li, P. Human Behavior Recognition Model Based on Improved EfficientNet. Procedia Comput. Sci. 2022, 199, 369–376. [Google Scholar] [CrossRef]
Zhou, A.; Ma, Y.; Ji, W.; Zong, M.; Yang, P.; Wu, M.; Liu, M. Multi-Head Attention-Based Two-Stream EfficientNet for Action Recognition. Multimed. Syst. 2023, 29, 487–498. [Google Scholar] [CrossRef]
Khan, T.; Choi, G.; Lee, S. EFFNet-CA: An Efficient Driver Distraction Detection Based on Multiscale Features Extractions and Channel Attention Mechanism. Sensors 2023, 23, 3835. [Google Scholar] [CrossRef] [PubMed]
Pawlak, Z. Rough Set Theory and Its Applications to Data Analysis. Cybern. Syst. 1998, 29, 661–688. [Google Scholar] [CrossRef]
Senthilnathan, C.R. Understanding Retail Consumer Shopping Behaviour Using Rough Set Approach. Int. J. Rough Sets Data Anal. 2016, 3, 38–50. [Google Scholar] [CrossRef]
Henry, C. Reinforcement Learning in Biologically-Inspired Collective Robotics: A Rough Set Approach. Masters Abstr. Int. 2006, 45, 1663. [Google Scholar]
Chu, X.; Sun, B.; Chu, X.; Wang, L.; Bao, K.; Chen, N. Multi-modal and Multi-criteria Conflict Analysis Model Based on Deep Learning and Domi-nance-based Rough Sets: Application to Clinical Non-parallel Decision Problems. Inf. Fusion 2025, 113, 102636. [Google Scholar] [CrossRef]
Sanchez-Riera, J.; Hsiao, Y.; Lim, T.; Hua, K.-L.; Cheng, W.-H. A Robust Tracking Algorithm for 3D Hand Gesture with Rapid Hand Motion Through Deep Learning. In Proceedings of the 2014 IEEE International Conference on Multimedia & Expo Workshops, Chengdu, China, 14–18 July 2014. [Google Scholar] [CrossRef]
Bell, D.A.; Beck, A.; Miller, P.; Wu, Q.X.; Herrera, A. Video Mining-Learning Patterns of Behaviour via an Intelligent Image Analysis System. In Proceedings of the Seventh International Conference on Intelligent Systems Design and Applications (ISDA 2007), Rio de Janeiro, Brazil, 20–24 October 2007; pp. 460–464. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Li, F.-F. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Zhang, Y.; Li, X.; Liu, C.; Shuai, B.; Zhu, Y.; Brattoli, B.; Chen, H.; Marsic, I.; Tighe, J. VidTr: Video Transformer without Convolutions. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 13557–13567. [Google Scholar]
Zhao, Z.; Huang, B.; Xing, S.; Wu, G.; Qiao, Y.; Wang, L. Asymmetric Masked Distillation for Pre-training Small Foundation Models. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 18516–18526. [Google Scholar]
Zhang, K.; Lyu, M.; Guo, X.; Zhang, L.; Liu, C. Temporal Shift Module-Based Vision Transformer Network for Action Recognition. IEEE Access 2024, 12, 47246–47257. [Google Scholar] [CrossRef]

Figure 1. The framework of the multi-person behavior recognition algorithm.

Figure 2. The network structure of an improved target behavior recognition module.

Figure 3. The performance of different backbone networks based on the SCA dataset. GMAC (Giga Multiply–Accumulate Operations) serves as a metric for quantifying the computational complexity, while Top-1 Accuracy represents the percentage of instances when the model’s primary (highest probability) prediction correctly matches the actual label.

Figure 4. Examples from the SCA dataset: (a) the left side shows the ship bridge, and the right side displays samples of driver behavior; and (b) a sequence of frames from a walking sample.

Figure 5. Variation of metrics with parameter changes: (a) impact on metrics with the confidence threshold fixed at 0.2, while varying the IoU threshold; and (b) impact on metrics with the IoU threshold fixed at 0.2, while varying the confidence threshold.

Figure 6. Comparison of various networks’ shallow feature maps: (a) original samples; (b) TSM–ResNet50; (c) TSM–ResNet101; and (d) the improved TSM.

Figure 7. Comparison of video inference results.

Figure 8. Behavior inference results: (a) standing, normal; (b) sitting, non-driving (smoking); and (c) standing, distracted driving (playing on a phone).

Figure 9. Inference results in crowded driving scenarios: (a) playing on a phone; and (b) smoking.

Table 1. The backbone network.

Stage	Operator	Resolution	Channels	Layers
$i$	$F_{i}$	$H_{i} * W_{i}$	$C_{i}$	$L_{i}$
1	Conv3 × 3	448 × 448	32	1
2	MBConv1, k3 × 3	224 × 224	16	2
3	MBConv6, k3 × 3	224 × 224	24	3
4	MBConv6, k5 × 5	112 × 112	40	3
5	MBConv6, k3 × 3	56 × 56	80	4
6	MBConv6, k5 × 5	28 × 28	112	4
7	MBConv6, k5 × 5	28 × 28	192	5
8	MBConv6, k3 × 3	14 × 14	320	2
9	Conv1 × 1	14 × 14	1280	1
10	CA_Block	14 × 14	1280	1
11	Pooling&FC	14 × 14	7	1

Table 2. The decision table.

Condition Attribute	Confidence of Associated Feature Objects	$C_{o} (cigarette) \in [0, 1]$
	Confidence of Associated Feature Objects	$C_{o} (phone) \in [0, 1]$
	Behavior Confidence	$C_{b} (d r i v i n g) \in [0, 1]$
		$C_{b} (s m o k i n g) \in [0, 1]$
		$C_{b} (playing on a phone) \in [0, 1]$
		$C_{b} (s i t t i n g) \in [0, 1]$
		$C_{b} (standing) \in [0, 1]$
		$C_{b} (w a l k i n g) \in [0, 1]$
		$C_{b} (lying) \in [0, 1]$
	Coverage Rate of Anomalous Features	$S_{o} (cigarette) \in [0, 1]$
	Coverage Rate of Anomalous Features	$S_{o} (phone) \in [0, 1]$
Decision Attribute	Anomalous Behavior	$D_{a} \in$ {non-driving (smoking), non-driving (playing on a phone), distracted driving (smoking), distracted driving (playing on a phone)}
Decision Attribute	Basic Mutually Exclusive Behavior	$D_{b} \in \{s i t t i n g, s t a n d i n g, w a l k i n g, l y i n g\}$

Table 3. Sample distribution of the SCA dataset.

Behavior	Number of Samples
Sitting	764
Standing	728
Walking	652
Lying	660
Driving	452
Smoking	920
Playing on a phone	612

Table 4. Parameter settings.

Category	Parameters	Value
Multi-object Tracking	Conf_thres	0.1–0.7
Multi-object Tracking	IOU_thres	0.1–0.7
Behavior Recognition	Input_size	448 px × 448 px
	Num_segment	8
	Learning_rate	2 × 10⁻⁵
	Batch_size	1
	Dropout	0.8
	Epoch	50

Table 5. Comparative results of personnel and feature object detection using YOLOv7 series models.

Model	Image Size	Precision	Recall	mAP@0.5	mAP@0.5:0.95	Latency (ms)
YOLOv7	640 px × 640 px	0.7454	0.6806	0.7029	0.4086	12.5
YOLOv7-w6	1280 px × 1280 px	0.812	0.726	0.78	0.503	30.9
YOLOv7-e6	1280 px × 1280 px	0.822	0.745	0.781	0.514	49.3
YOLOv7-d6	1280 px × 1280 px	0.827	0.7434	0.7964	0.5101	60.6
YOLOv7-e6e	1280 px × 1280 px	0.838	0.703	0.776	0.513	71.6

Table 6. Comparison of different models. GMAC values represent billions of multiply–accumulate operations.

Model	Backbone	GMAC	Parameters/M	Top-1 Accuracy/%
TSM	ResNet50	132.17	23.52	77.79
	ResNet101	251.62	42.51	81.4
	EfficientNet-B0	13.07	4.02	76.34
	EfficientNet-B1	19.26	6.52	80.65
	EfficientNet-B2	22.16	7.71	78.50
	EfficientNet-B3	32.23	10.71	80.11
	EfficientNet-B4	50	17.56	77.42
SlowFast	ResNet50	101.16	33.66	75.1
SlowFast	ResNet101	163.88	52.87	76.66
Improved TSM	EfficientNet-B3	32.28	10.93	82.1

Table 7. Ablation study results. A checkmark symbol (√) denotes the inclusion of a component in the model configuration, while empty cells indicate its exclusion.

Component	M1	M2	M3
TSM–EfficientNet-B3	√	√	√
+Image Augmentation		√	√
+CA Attention			√
Computational Cost (GMAC)	32.23	32.23	32.26
Top-1 Accuracy (%)	79.03	80.32	82.1

Table 8. Comparative results based on public datasets.

Method	Backbone Network	Parameters/M	UCF101 /%	HMDB51 /%
TSN [19]	InceptionV2	10.7	84.4	53.3
I3D [15]	InceptionV2	25	95.6	74.8
TSM [20]	ResNet50	23.52	96.0	73.2
VidTr [36]	VidTr-L	86	96.7	74.4
AMD [37]	ViT-B	87	97.1	79.6
Improved TSM	Improved EfficientNet-B3	10.93	96.7	75.1

Table 9. Comparison of class-wise accuracy for different models.

Behavior	SlowFast–ResNet50 (%)	TSM–ResNet50 (%)	Improved TSM (%)
Sitting	83.24	84.53	89.27
Standing	82.45	85.08	88.63
Walking	81.03	83.47	87.15
Lying	84.12	85.62	88.29
Driving	80.58	82.14	91.35
Playing on a phone	58.31	62.08	65.42
Smoking	60.17	64.25	68.09
Top-1 accuracy	75.10	77.79	82.10

Table 10. Comparison of anomalous behavior detection results using different methods.

Method	AFAR/%			CCA/%
Method	All	Smoking	Playing on a Phone	CCA/%
TSM–ResNet50	74.46	76.55	71.3	80.36
Improved TSM	76.3	77.57	74.38	81.84
Multi-Source Feature Fusion	83.37	84.47	81.69	83.66

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, C.; Wei, Y.; Ma, F.; Shu, Z. Cross-Modal Behavioral Intelligence in Regard to a Ship Bridge: A Rough Set-Driven Framework with Enhanced Spatiotemporal Perception and Object Semantics. Appl. Sci. 2025, 15, 7220. https://doi.org/10.3390/app15137220

AMA Style

Chen C, Wei Y, Ma F, Shu Z. Cross-Modal Behavioral Intelligence in Regard to a Ship Bridge: A Rough Set-Driven Framework with Enhanced Spatiotemporal Perception and Object Semantics. Applied Sciences. 2025; 15(13):7220. https://doi.org/10.3390/app15137220

Chicago/Turabian Style

Chen, Chen, Yuenan Wei, Feng Ma, and Zhongcheng Shu. 2025. "Cross-Modal Behavioral Intelligence in Regard to a Ship Bridge: A Rough Set-Driven Framework with Enhanced Spatiotemporal Perception and Object Semantics" Applied Sciences 15, no. 13: 7220. https://doi.org/10.3390/app15137220

APA Style

Chen, C., Wei, Y., Ma, F., & Shu, Z. (2025). Cross-Modal Behavioral Intelligence in Regard to a Ship Bridge: A Rough Set-Driven Framework with Enhanced Spatiotemporal Perception and Object Semantics. Applied Sciences, 15(13), 7220. https://doi.org/10.3390/app15137220

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Cross-Modal Behavioral Intelligence in Regard to a Ship Bridge: A Rough Set-Driven Framework with Enhanced Spatiotemporal Perception and Object Semantics

Abstract

1. Introduction

2. Literature Review

2.1. Multi-Object Tracking

2.2. Behavior Recognition

2.3. Feature Extraction Network

2.4. Rough Set

3. The Proposed Approach

3.1. Personnel and Feature Object Detection Based on YOLOv7

3.2. Improved TSM for Behavior Recognition

3.3. Multi-Source Feature Fusion Based on RS

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Results

4.3.1. Results of Personnel and Feature Object Detection

4.3.2. Results of Driver Tracking

4.3.3. Results of Behavior Recognition

4.3.4. Multi-Source Feature Fusion Inference Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI