CDPA-Net: An Indoor Work Sites Smoking Detection Framework Based on Contour-Driven Pose-Aware Feature Learning

Wang, Meng; Li, Mei; He, Chao

doi:10.3390/math14091462

Open AccessArticle

CDPA-Net: An Indoor Work Sites Smoking Detection Framework Based on Contour-Driven Pose-Aware Feature Learning

by

Meng Wang

^1,2,3

,

Mei Li

^1,*

and

Chao He

⁴

¹

School of Artificial Intelligence, China University of Geosciences (Beijing), Beijing 100083, China

²

Hebei Technology Innovation Center for Key Components of Climbing Robots, Tangshan Polytechnic University, Tangshan 063299, China

³

Beijing Qishan Chuangzhi Technology Co., Ltd., Beijing 100192, China

⁴

State Key Laboratory of Advanced Rail Autonomous Operation, Beijing Jiaotong University, Beijing 100044, China

^*

Author to whom correspondence should be addressed.

Mathematics 2026, 14(9), 1462; https://doi.org/10.3390/math14091462

Submission received: 29 March 2026 / Revised: 19 April 2026 / Accepted: 23 April 2026 / Published: 26 April 2026

Download

Browse Figures

Versions Notes

Abstract

Smoking detection in indoor work sites is challenging due to posture variability, object occlusion, poor lighting, and the small size of cigarettes. These factors hinder the extraction of reliable pose-aware features. Such features include hand–cigarette orientation and contours, which are critical for smoking detection. However, current mainstream detectors, such as YOLO-based methods, fail to capture pose-aware features under cluttered and low-visibility conditions. To address this, we propose the Contour-Driven Pose-Aware Network (CDPA-Net), which explicitly captures contour orientation and high-frequency appearance cues for robust smoking detection. Specifically, the Orientation-Driven Contour Extractor (ODCE) employs a Nonsubsampled Contourlet Transform to capture direction-sensitive posture and contour features, effectively suppressing background clutter. Additionally, the Frequency-Sensitive Attention Block (FSAB) highlights high-frequency discriminative signals under dim light via frequency-domain self-attention. Moreover, the Multi-Scale Frequency Integration Module (MFIM) fuses structural and spectral cues across scales to reinforce pose-aware representation. Experiments on both a public and a custom industrial dataset show that CDPA achieves 89.2% mAP50 at 112 FPS. This work provides a lightweight, interpretable, and accurate solution for smoking detection in industrial monitoring applications.

Keywords:

small object detection; contourlet transform; pose-aware feature; feature fusion; smoking detection

MSC:

68U10; 68T45; 68T07

1. Introduction

Smoking in indoor work sites poses serious safety risks, particularly in areas with flammable materials [1]. Timely and accurate detection of smoking behavior is essential for ensuring workplace safety [2]. However, manual inspections are inefficient and limited. Conventional object detectors, including YOLO-based approaches, often struggle to meet the demands of dynamic and cluttered monitoring environments [3]. These limitations highlight the urgent need for automated and intelligent surveillance solutions in industrial settings [4].

Indoor work sites present unique visual challenges for smoking detection. As illustrated in Figure 1, the environment suffers from complex backgrounds, varying lighting conditions, and non-uniform smoking postures [5]. Smoking targets are typically minute, visually similar to background elements, and frequently obscured by hands or industrial tools [1,6]. Although YOLO and Transformer-based models have achieved success in controlled settings [2,7], and relational modeling has been introduced via human–object interaction (HOI) methods like HOLT-Net [8], their performance often degrades in industrial scenarios. Most models are optimized for daily-life datasets. They struggle to extract robust pose-aware features. However, indoor work sites present extreme occlusion and dim lighting. These conditions severely hinder the extraction of reliable features [9].

Crucially, the absence of frequency-aware modeling limits conventional detectors in capturing fine-grained pose features, such as cigarette orientation and finger contours, which are often obscured by noise and lighting fluctuations in cluttered sites [10]. Frequency-domain cues are vital for describing these structural patterns. These include edge continuity and localized high-frequency energy. However, spatial-domain methods fail to exploit them sufficiently [11]. As shown in Figure 2, frequency decomposition uncovers discriminative directional textures aligned with smoking postures. These cues are typically suppressed in the spatial domain, yet they are critical for robust detection under the multi-source interference prevalent in industrial environments [5,12].

We summarize the technical challenges of indoor work site smoking detection as follows:

Pose-Aware Interference from Complex Backgrounds: Smoking actions often involve subtle gestures that are easily obscured by tools or machinery in complex backgrounds. Most existing detectors rely on coarse appearance features. They lack directional or contour-based modeling. This makes it hard to tell smoking postures apart from structurally similar background noise [13].
Low Visibility in Dim Lighting: Poor lighting degrades edge clarity and contrast. Most existing detectors operate in the spatial domain and fail to extract meaningful features when low-level textures are weak. They lack frequency-domain mechanisms that can capture informative high-frequency cues under dim conditions [14].
Ineffective Multi-Scale Pose Feature Integration: Smoking detection depends on both local (cigarette contour) and global (hand–mouth alignment) cues. Traditional models lack specialized modules for fusing pose-aware features across scales, leading to fragmented representations and poor small-object localization [15].

To address these critical challenges, we propose the Contour-Driven Pose-Aware Network (CDPA-Net), a specialized detection framework designed for smoking behavior recognition in indoor work sites. CDPA-Net introduces a novel combination of spatial and frequency-domain cues. It focuses on extracting pose-aware features that are robust to cluttered backgrounds, lighting variation, and object occlusion. In contrast to conventional detectors that rely solely on spatial appearance, CDPA-Net employs contour-oriented transforms and frequency-aware attention to capture hand posture, cigarette orientation, and boundary continuity with higher fidelity.

Specifically, CDPA-Net incorporates three task-driven modules. The Orientation-Driven Contour Extractor (ODCE) applies a Nonsubsampled Contourlet Transform (NSCT) to extract shape-aware and direction-sensitive features. This enables the model to isolate smoking-related contours and directional cues from surrounding noise, addressing posture ambiguity and object overlap. The Frequency-Sensitive Attention Block (FSAB) transforms features into the frequency domain via discrete cosine transform (DCT) and enhances high-frequency responses by channel-wise attention. This module compensates for degraded visibility in dim environments and reinforces discriminative details in noisy conditions. The Multi-Scale Frequency Integration Module (MFIM) fuses the outputs of ODCE and FSAB across multiple scales. It integrates structural and frequency-based features into a unified representation. This enhances the model’s ability to detect small smoking-related targets under multi-scale variations. These components work synergistically. ODCE captures posture-aligned structure, FSAB extracts discriminative frequency cues under poor lighting, and MFIM bridges both to achieve robust pose-aware fusion. This integrated design allows CDPA-Net to outperform previous methods in two key aspects. It excels at detecting small objects and resists environmental interference effectively. The main contributions of this work are:

Pose-Aware Detection Architecture: We propose CDPA-Net, a contour-driven spectral–spatial model that effectively extracts pose-aware features from cluttered industrial environments.
Direction-Sensitive Contour Extraction: We design the ODCE module leveraging NSCT to capture non-grid-aligned smoking contours. It is more interpretable than standard convolutions.
Frequency-Aware Enhancement and Fusion: We introduce FSAB and MFIM to adaptively amplify high-frequency signals and bridge the semantic gap across resolutions. This ensures stable detection under poor lighting.

The workflow of CDPA-Net is shown in Figure 3. The rest of this paper is organized as follows: Section 2 reviews existing research algorithms and highlights the primary challenges in smoking detection. Section 3 elaborates on the structure of the proposed CDPA-Net model, detailing its modular design and implementation. Section 4 presents the experimental setup and compares CDPA-Net with state-of-the-art (SOTA) methods. Section 5 presents the experimental findings and outlines the deployment strategy for cloud-edge engineering applications. Finally, Section 6 concludes this research and offers future directions for indoor work site smoking detection.

2. Related Works

2.1. Small Object Detection

Small object detection remains one of the most persistent challenges in computer vision, particularly in cluttered and dynamic environments like indoor work sites [10]. Smoking targets in indoor work sites are often smaller than 50 × 50 pixels. This is thus classed as a form of small object detection [16]. The smoker is partially occluded by fingers or tools and is visually similar to the background textures. These conditions complicate the extraction of discriminative features for traditional detectors, frequently leading to missed detections or false alarms. The YOLO series has long been a benchmark for real-time object detection [17]. From YOLOv8 to YOLOv12, continuous improvements have been made in terms of accuracy, speed, and deployment efficiency across diverse domains [18,19,20,21,22]. To better detect small-scale features in noisy scenes, ref. [23] incorporated the Mamba structure into YOLO. The model expands its receptive field and enhances its perception of fine-grained details.

Transformer-based models have significantly advanced small object detection and behavior recognition [24]. Combining Transformers with wavelet decomposition improves the extraction of structural and shape features [25]. For instance, wavelet-based downsampling preserves critical details during attention learning [26]. Similarly, contourlet-based compression allows feature maps to be reconstructed without losing data integrity [13]. These designs are highly effective at retaining contour edges and textural variations [27]. Recent studies also improve detection through modularized heads and multi-level fusion [28,29,30]. Training strategies like dynamic label assignment and denoised distillation further boost precision [31,32]. Adaptive sampling in deformable attention manages multi-scale information efficiently without slowing inference [33]. Industrial applications specifically require models to be both lightweight and robust [12,34].

Despite these gains, most methods struggle in unstructured indoor work sites. These environments feature dim lighting, heavy clutter, and frequent occlusions [35]. Such factors severely degrade the visibility and saliency of small targets like cigarettes [36]. Subtle smoking gestures remain difficult to detect under these conditions. Furthermore, standard models often fail to generalize due to training on clean public datasets [14]. There is an urgent need for task-specific, domain-adaptive models. These systems must maintain high precision across diverse and complex industrial conditions [37].

2.2. Smoking Detection

Smoking detection, as a specialized subtask of behavioral recognition, has received growing attention due to its critical importance in public safety and industrial regulation [15]. To address these requirements, various deep learning models have been developed for diverse scenarios. For instance, in industrial environments such as chemical plants, Smoking-YOLOv8 extends the YOLOv8 architecture by incorporating an activity attention mechanism to better detect smoking behavior in dynamic settings [1]. Another study introduced SmokerViT, a model inspired by the Vision Transformer, which utilizes self-attention mechanisms to achieve high performance on a 1120 image dataset [7]. Similarly, Pandey et al. utilized Haar Cascade technology to localize the driver’s face before applying YOLO-NAS to detect mouth and eye regions [4]. Beyond general detection methods, interactive smoking behavior recognition has also gained prominence. In public settings, Ling et al. proposed HOLT-Net, a framework that identifies smoker–cigarette interactions from single images by integrating a HOI model and utilizing the SCAU-SD benchmark dataset [8]. However, HOI-based models often struggle in industrial work sites where cigarettes typically appear as small, occluded targets. These challenging environmental characteristics significantly weaken visual cues and limit the precision of interaction-based detection methods [38]. Furthermore, the BEC-YOLO algorithm enhances the detection of smoking targets by addressing small-object challenges in complex environments through targeted optimization strategies [39]. Overall, Transformer-based models like SmokerViT offer promising solutions via global attention, supporting high-accuracy safety monitoring in specialized scenarios [7].

Existing methods struggle in indoor work sites due to dim lighting, background clutter, and severe occlusion [40]. Models trained on daily-life datasets often fail to generalize to complex industrial environments [11]. Furthermore, pose-aware cues like hand–cigarette alignment and directional contours remain under-explored [9]. Such features are critical for distinguishing smoking from background noise and similar-looking tools [15]. These gaps highlight the need for domain-adaptive features capable of robust detection under multi-source interference [2,41]. Our work addresses this gap by explicitly modeling posture-aligned structure and frequency-domain cues for reliable smoking detection in complex indoor work sites.

3. Methodology

3.1. Dataset Construction and Analysis

We introduce the Indoor Construction Work Sites Smoking Detection (ICSSD) dataset, which contains 6890 images collected from real-world industrial scenes. The initial annotations were conducted in the VOC format. The labeling process specifically targets the cigarette region. By focusing on this localized area, the model can effectively capture critical features, including the directional orientation of the cigarette and its alignment with the mouth contour. These cues are essential for distinguishing subtle smoking behaviors from complex background clutter and tools. Each image underwent a manual annotation and cross-verification process to ensure labeling precision. The dataset was divided into training–validation and test sets in a 9:1 ratio. The training–validation set was further split into training and validation sets in a 3:1 ratio. We compared our dataset with SCAU-SD, the only public dataset in current research. The SCAU-SD dataset mainly consists of images from daily life, featuring simpler backgrounds and close-up shots. In contrast, the ICSSD dataset is entirely derived from indoor work sites, featuring highly complex and variable backgrounds, as well as difficult-to-recognize features.

The SCAU-SD dataset comprises 1200 training–validation images and 360 test images, which were primarily captured in daily life settings with clean backgrounds and close-range perspectives. In contrast, our ICSSD dataset is exclusively collected from indoor work sites. ICSSD covers highly variable and cluttered scenes. This makes ICSSD more representative of real-world industrial conditions, particularly under challenging visual interference. Moreover, the ICSSD dataset is well-suited for small object detection tasks, which are frequently encountered in work site monitoring. To illustrate this contrast, Figure 4 presents side-by-side visual comparisons. As shown, ICSSD contains more complex and dynamic backgrounds, whereas SCAU-SD is largely composed of simpler scenes with larger and more salient smoking targets.

The absolute and relative sizes of smokers in the ICSSD dataset are considerably smaller than those in the SCAU-SD dataset. The absolute size of ICSSD smokers constitutes only 3.823% of the corresponding value in SCAU-SD, while the relative size is just 2.143%. Table 1 compares the dimensions of the two datasets. The ICSSD dataset’s distribution is highly consistent with the characteristics of indoor work site scenarios.

The ICSSD dataset contains a substantial number of small objects, which are essential for effective smoking detection. Due to their limited pixel footprint, these small targets present challenges for accurate feature extraction. Figure 5 illustrates the distribution comparison between the ICSSD and SCAU-SD datasets. ICSSD features a larger data volume, with most annotated targets classified as small objects. Most of these samples are smaller than 50 × 50 pixels. It is a typical small object detection problem [16]. This distribution closely aligns with the conditions observed in real-world indoor work sites. The dataset features complex backgrounds and frequent visual ambiguities, such as those caused by work uniforms and tools. These pose significant challenges for practical deployment.

3.2. Overall of CDPA-Net’s Architecture

CDPA-Net is a specialized deep learning framework tailored for smoking behavior detection in indoor work sites. It is designed to tackle three core challenges in this domain: (1) severe background interference, (2) difficulty in identifying small smoking-related targets under dim light, and (3) the challenge of effectively fusing multi-scale features in dynamic environments. To address these issues, CDPA-Net adopts a multi-branch modular architecture composed of three dedicated modules, including ODCE, FSAB, and MFIM. These modules are integrated within a lightweight backbone that balances feature extraction efficiency and detection accuracy [27,42]. The structure of CDPA-Net is shown in Figure 6.

Firstly, ODCE is introduced early in the backbone to extract directional and structural features relevant to smoking behaviors. It utilizes NSCT to retain spatial resolution and applies a directional filter to isolate contour edges and orientation cues from cluttered scenes. This ensures that critical hand–mouth–cigar interactions are preserved even under occlusion or noise.

Next, FSAB enhances the discriminability of mid-level features by transforming them into the frequency domain by DCT. This operation emphasizes high-frequency components, such as cigarette edges and smoker contours. Through a channel-wise attention mechanism in frequency domain, FSAB adaptively strengthens salient frequency components before transforming the features back to the spatial domain.

Finally, MFIM operates at the feature fusion stage in the head. It simultaneously receives high-resolution features that capture local detail and low-resolution features that encode global context. Both are compressed, aligned, and projected into the frequency domain. MFIM uses attention-based weight to integrate multi-scale frequency features and restore them via inverse DCT. It allows the model to maintain robustness across diverse spatial resolutions. The changes in the feature maps are shown in Table 2.

3.3. ODCE Module

The extraction of pose-aware features is the cornerstone of robust smoking detection in unstructured industrial environments. However, smoking behaviors are characterized by subtle geometric structures, such as the slender cylindrical shape of cigarettes and the specific curvature of finger-holding gestures. Traditional Convolutional Neural Networks struggle to capture fine-grained directional cues. Their standard square kernels are suboptimal for representing non-grid-aligned contours in cluttered backgrounds [27]. To address this, we propose the ODCE module, which is specifically designed to isolate posture-aligned structural features from environmental noise [9].

This module leverages the NSCT to capture multi-scale and multi-directional contour information. In our implementation, NSCT utilizes fixed Maxflat filters to maintain shift invariance and provide stable geometric priors, while the differentiability of the overall module is preserved through the encompassing trainable layers. The rationale for choosing NSCT involves two critical properties: multidirectionality and shift invariance. Multidirectionality allows the model to decompose the scene into multiple precise orientations, matching the arbitrary angles of cigarette placement. Shift-invariance, achieved by avoiding downsampling, ensures that the precise spatial localization of small smoking-related targets is preserved across the feature maps.

The decomposition process of NSCT consists of two main components: the Nonsubsampled Pyramid Transform (NSP) and the Nonsubsampled Directional Filter Bank (NSDFB). The NSP decomposes the input feature

I (x, y)

into a multiscale low-frequency subband

L_{k} (x, y)

and high-frequency subbands

H_{k} (x, y)

:

I (x, y) = L_{K} (x, y) + \sum_{k = 1}^{K} H_{k} (x, y)

(1)

where K represents the number of decomposition levels. In this work, we set

K = 2

to balance feature granularity and computational overhead.

Subsequently, the NSDFB is applied to decompose each high-frequency subband

H_{k} (x, y)

into

D_{k}

directional subbands

H_{k}^{(d)} (x, y)

to capture the multi-angled edges of the hand–cigarette interaction:

H_{k} (x, y) = \sum_{d = 1}^{D_{k}} H_{k}^{(d)} (x, y)

(2)

where d is the directional index and

D_{k}

denotes the number of directions at level k. By combining these processes, the input feature is represented as follows:

I (x, y) = L_{K} (x, y) + \sum_{k = 1}^{K} \sum_{d = 1}^{D_{k}} H_{k}^{(d)} (x, y)

(3)

In ODCE, low-frequency and high-frequency directional subbands are processed separately to extract complementary features. The low-frequency subband

L_{k} (x, y)

describes the global feature distribution, e.g., the shape of the smoker. It is enhanced by a standard 2D convolution:

L_{k}^{'} (x, y) = Conv (L_{k} (x, y))

(4)

For the high-frequency subbands

H_{k}^{(d)} (x, y)

, which capture local textures and edges, directional convolution DirConv is applied to emphasize directional saliency:

H_{k}^{{(d)}^{'}} (x, y) = DirConv (H_{k}^{(d)} (x, y))

(5)

Finally, the module integrates these components via the inverse wavelet transform (IWT) to generate an enhanced feature map

I^{'} (x, y)

that retains spatial resolution and orientation cues:

I^{'} (x, y) = IWT (L_{k}^{'} (x, y), {H_{k}^{{(d)}^{'}} (x, y)}_{k, d})

(6)

The workflow of ODCE is depicted in Figure 7, and the detailed structure is shown in Figure 8. The pseudo-code of the module is presented in Algorithm 1. This integrated design ensures that even under severe occlusion, the geometric “signature” of a smoking posture remains distinguishable from background noise.

Algorithm 1 Orientation-driven contour extractor (ODCE)

Require: Feature map $I (x, y) \in R^{B \times C \times H \times W}$ , Number of levels K, Number of directions D.
Ensure: Enhanced pose-aware feature map $I^{'} (x, y) \in R^{B \times C \times H \times W}$

1:: $I_{1} (x, y) \leftarrow I (x, y)$
2:: for $k = 1$ to K do
3:: NSCT Decomposition:
4:: Perform NSP to obtain ${L_{k} (x, y), H_{k} (x, y)}$
5:: Apply NSDFB on $H_{k}$ to obtain ${H_{k}^{(1)} (x, y), \dots, H_{k}^{(D)} (x, y)}$
6:: Low-frequency Enhancement:
7:: $L_{k}^{'} (x, y) \leftarrow C O N V (L_{k} (x, y))$
8:: High-frequency Directional Enhancement:
9:: for $d = 1$ to D do
10:: $H_{k}^{{(d)}^{'}} (x, y) \leftarrow DirConv (H_{k}^{(d)} (x, y))$
11:: end for
12:: Feature Reconstruction:
13:: $I_{k + 1} (x, y) \leftarrow I W T (L_{k}^{'}, H_{k}^{{(1)}^{'}}, \dots, H_{k}^{{(D)}^{'}})$
14:: end for
15:: Final Global Refinement:
16:: $I^{'} (x, y) \leftarrow C O N V (I_{K + 1} (x, y))$
17:: return $I^{'} (x, y)$

3.4. FSAB Module

The visibility of smoking-related targets, such as cigarettes and hand–mouth silhouettes, is often severely compromised in indoor work sites due to dim lighting and heavy industrial dust. In these low-contrast environments, conventional spatial-domain convolutions often fail to distinguish weak object edges from background noise, as they rely on local pixel intensities that are easily corrupted by lighting fluctuations [24]. We propose FSAB to overcome this limitation. The core rationale behind FSAB is that while lighting variations predominantly affect low-frequency components, the structural details of smoking gestures are encoded in specific high-frequency spectral components. By shifting the feature learning process from the spatial domain to the frequency domain, we can implement a more robust spectral filtering mechanism that adaptively amplifies discriminative signals while suppressing environmental noise. FSAB utilizes DCT to achieve this spectral decomposition. Compared to the Fourier Transform, DCT concentrates the feature energy into a more compact set of coefficients, which facilitates the modeling of global texture patterns and structural periodicity with lower computational overhead.

For an input feature map

X \in R^{B \times C \times H \times W}

, where B is the batch size, C represents the number of channels, and H and W denote the spatial dimensions, FSAB first partitions the feature map into multiple

P \times P

sub-blocks. In this work, we set the block size to

P = 8

to balance frequency resolution and computational efficiency. This partitioning allows the model to capture localized frequency responses while maintaining a degree of spatial context. For each sub-block

X (x, y)

, the 2D DCT is formulated as:

\begin{matrix} Y (u, v) = & α (u) α (v) \sum_{x = 0}^{P - 1} \sum_{y = 0}^{P - 1} X (x, y) cos [\frac{(2 x + 1) u π}{2 P}] \\ \cdot cos [\frac{(2 y + 1) v π}{2 P}] \end{matrix}

(7)

In Equation (7), P denotes the block size, while u and v represent the horizontal and vertical frequency indices, respectively. The normalization scaling factor

α (u)

is defined as:

α (u) = \{\begin{matrix} \sqrt{\frac{1}{P}}, & if u = 0, \\ \sqrt{\frac{2}{P}}, & otherwise . \end{matrix}

The resulting frequency components

Y (u, v)

provide an explicit representation of the image content: low-frequency components describe the global structure and illumination, whereas high-frequency components capture fine-grained textures and edges critical for small-object detection. To achieve frequency-aware enhancement, FSAB employs a channel-wise attention mechanism in this spectral space. Specifically, global frequency statistics are extracted, and channel weights are generated through a fully connected (FC) layer followed by a Sigmoid activation

σ

:

Y^{'} (u, v) = σ (FC (Y)) ⊙ Y (u, v)

(8)

where ⊙ denotes element-wise multiplication. This operation enables the model to dynamically “re-weight” different frequency bands. In dim-light scenarios, the attention mechanism is trained to assign higher weights to high-frequency components that represent cigarette boundaries, effectively “boosting” the signal that would otherwise be lost in the spatial domain.

Finally, the enhanced frequency features

Y^{'} (u, v)

are mapped back to the spatial domain using the Inverse Discrete Cosine Transform (IDCT):

\begin{matrix} X^{'} (x, y) = & \sum_{u = 0}^{P - 1} \sum_{v = 0}^{P - 1} Y^{'} (u, v) cos [\frac{(2 x + 1) u π}{2 P}] \\ \cdot cos [\frac{(2 y + 1) v π}{2 P}] \end{matrix}

(9)

By integrating FSAB, CDPA-Net reinforces its pose-aware representation in environments where spatial cues alone are unreliable. This frequency-domain approach offers two distinct advantages over traditional spatial attention. First, it captures long-range structural dependencies more efficiently through spectral modeling. Second, it provides better interpretability, as the attention weights can be understood as an adaptive filter tuned to specific smoking-related textures. Figure 9 shows the structure of FSAB.

3.5. MFIM Module

Effective smoking detection requires the seamless integration of local fine-grained details (e.g., cigarette contours) and global semantic context (e.g., smoker posture and hand–mouth alignment). However, traditional spatial-domain fusion methods, such as element-wise addition or concatenation, often suffer from the semantic gap between features of varying resolutions. In cluttered industrial environments, direct spatial fusion can lead to feature misalignment and the propagation of background noise, particularly when merging high-resolution structural cues from the ODCE module with mid-level frequency signals from FSAB. To address these challenges, we propose MFIM. The rationale behind MFIM is to perform cross-scale integration within a unified spectral space, where structural and semantic cues can be effectively aligned and enhanced while suppressing redundant spatial interference.

MFIM utilizes DCT to bridge features across resolutions with frequency awareness. High-resolution features preserve critical local structures, such as finger contours and cigarette edges, whereas low-resolution features provide the necessary semantic context and shape continuity. To initiate fusion, the low-resolution feature map is first spatially interpolated to match the dimensions of the high-resolution counterpart. Subsequently,

1 \times 1

convolutions are employed for channel compression and initial alignment:

X_{HR - compressed} = W_{HR} \times X_{HR}

(10)

X_{LR - compressed} = W_{LR} \times X_{LR - upsampled}

(11)

where

W_{HR}

and

W_{LR}

denote the learnable convolution kernels. Following compression, the high-resolution features are projected into the frequency domain via 2D DCT to extract global spectral statistics:

\begin{matrix} Y (u, v) = & α (u) α (v) \sum_{x = 0}^{H - 1} \sum_{y = 0}^{W - 1} X_{HR - compressed} (x, y) \\ \cdot cos [\frac{(2 x + 1) u π}{2 H}] \cdot cos [\frac{(2 y + 1) v π}{2 W}] \end{matrix}

(12)

To dynamically modulate the fusion process, MFIM incorporates a frequency-aware channel attention mechanism. By computing a spectral descriptor

Y_{mean}

, the model generates weights that prioritize frequency bands representing salient smoking-related structures:

Y^{'} (u, v) = σ (FC (Y_{mean})) ⊙ Y (u, v)

(13)

After spectral enhancement, the features are restored to the spatial domain via IDCT:

\begin{matrix} X_{HR}^{'} (x, y) = & \sum_{u = 0}^{H - 1} \sum_{v = 0}^{W - 1} Y^{'} (u, v) \\ \cdot cos [\frac{(2 x + 1) u π}{2 H}] \cdot cos [\frac{(2 y + 1) v π}{2 W}] \end{matrix}

(14)

The final pose-aware representation

X_{fused}

is obtained by concatenating the frequency-aligned high-resolution features

X_{HR}^{'}

with the compressed low-resolution context

X_{LR, compressed}

along the channel dimension:

X_{fused} = FusionLayer (X_{HR}^{'}, X_{LR, compressed})

(15)

This integrated design enables CDPA-Net to maintain robust detection across diverse spatial resolutions, effectively filtering out environmental noise while reinforcing the directional awareness of small smoking targets. Figure 10 shows the structure of MFIM.

4. Experiment

4.1. Experimental Platform and Parameter Settings

The experimental hardware and software configurations are summarized in Table 3, while the detailed training hyperparameter settings are provided in Table 4. The model training was conducted over 300 epochs using a cosine annealing schedule to adaptively adjust the learning rate.

4.2. Evaluation Metrics

To quantitatively evaluate the detection performance of the proposed model, we employ metrics such as Average Precision (AP), mean Average Precision (mAP), mAP at a 50% Intersection-over-Union (IOU) threshold (mAP50), mAP across IOU thresholds ranging from 50% to 95% (mAP50-95), and the F1 Score. Furthermore, to assess the model’s size and efficiency, we utilize indicators such as floating point operations per second (FLOPs), model size, number of parameters (Params), and frames per second (FPS). The computational formulas for AP and mAP are presented in Equations (16)–(19):

P = \frac{T P}{T P + F P}

(16)

R = \frac{T P}{T P + F N}

(17)

A P = \int_{0}^{1} P (R) d R

(18)

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P (i)

(19)

T P

stands for true positive examples, referring to the number of correctly detected targets.

F P

stands for false positive examples, referring to the number of incorrectly detected targets.

F N

stands for false negative examples, referring to the number of missed targets. In Equation (16), P represents precision. In Equation (17), R represents the recall. N represents the total number of categories in Equation (19).

4.3. Comparative Experiments of Different Models

To thoroughly validate the effectiveness of the proposed CDPA-Net model, we compared it against various SOTA models using the ICSSD dataset. These models include ATSS, Cascade R-CNN, DDQ, D-FINE-m, DINO, Faster R-CNN, FCOS, GFL, Mamba-YOLOV-B, RetinaNet, RTDETRV2-R18, YOLOX, YOLOV8m, YOLOV9, YOLOV10m, YOLO11m, and YOLO12s. Our proposed CDPA-Net achieved superior performance in terms of accuracy, recall, mAP, and F1-score. When considering comprehensive evaluation metrics, CDPA-Net demonstrated the best overall performance among all the tested object detection models. Notably, CDPA-Net achieved an accuracy score 0.02 higher than the second-best YOLO11m. Its recall was 0.01 higher than the second-best DINO. In terms of mAP50, CDPA-Net surpassed the second-best D-FINE-m by 0.033. It achieved a mAP50-95 score 0.02 higher than YOLOV9 and an F1-score 0.022 higher than YOLO11m. Table 5 shows the results of different models. By leveraging multi-scale pose-aware features, CDPA-Net effectively isolates smoking behaviors from noisy and dynamic backgrounds. The results demonstrate that CDPA-Net is particularly effective at extracting small targets in indoor work site scenarios, outperforming other methods in this challenging application.

CDPA-Net demonstrates superior computational efficiency and a significantly reduced memory footprint. Complexity analysis shows that the model operates at only 7.0 GFLOPs with 2.396 M parameters. Compared to the second-best lightweight model, YOLOX, CDPA-Net reduces the parameter count by 2.634 M and the file size by 35.3 MB. Regarding inference cost, the framework achieves a high-speed processing rate of 112 FPS, which translates to a low latency of approximately 8.9 ms per frame [4]. This balance between accuracy and computational efficiency makes CDPA-Net highly suitable for indoor work sites, where real-time performance and resource constraints are critical. Table 6 lists the parameters of different models.

In this section, we present the inference results on a set of sample images obtained from different models, as shown in Figure 11. Many existing models fail to detect smoking targets in complex indoor work sites. In contrast, our proposed CDPA-Net effectively accomplishes this task. CDPA-Net is more lightweight than other task-capable models. It is well-suited for object detection in challenging indoor work site environments. These images cover several common hard-to-detect scenarios in indoor work sites. The scenarios include various smoking postures, dim light conditions, complex environments and multiple smokers.

Image 1 and image 2 present various smoking postures. In image 1, the smoking targets were not detected by Cascade R-CNN, DDQ, FCOS, RetinaNet, RTDETR2-R18, YOLOV8m, YOLOV9, and YOLOV10m. In image 2, ATSS, Cascade R-CNN, DDQ, Faster R-CNN, FOCS, GFL, RetinaNet, RTDETR2-R18, YOLOV8m, YOLOV9, YOLOX, YOLO11m did not detect the smoking targets. Image 3 and image 4 present dim light conditions. In image 3, the smoking targets were missed by Cascade R-CNN, Faster R-CNN, Mamba-YOLO-B, RTDETR2-R18, YOLOV10m, and YOLOV12s. In image 4, the smoking targets were not detected by GFL, and RetinaNet. These models exhibit deficiencies in extracting small object features of indoor work site smoking in various situations. Under the complex environments shown in image 5, ATSS, DINO, Faster RCNN, FCOS, GFL, Mamba-YOLO-B, YOLOV8m, YOLOV9, YOLOV10m, YOLOX, and YOLOV11m were unable to successfully detect smokers. In image 6, there are not only multiple people but also heavy interference in different types. The clothes of the smoker significantly interfered with the detection. Cascade R-CNN, DINO, Faster R-CNN, Mamba-YOLO-B, RetinaNet, RTDETR2-R18, YOLOV8m, YOLOV9, YOLOV10m, YOLO11m, YOLOV12s, and CDPA-Net successfully completed detection. For images 7 and 8, which feature multiple smokers, only CDPA-Net and DINO successfully identified all smokers. Moreover, CDPA-Net achieved higher prediction scores than DINO. Additionally, CDPA-Net successfully detected all smoking targets across the images, including challenging detection targets in image 3, 5, 7, and 8. The detection samples provided in the experiment include varying smoking postures, dim light conditions, complex environments, and multiple smokers. The experiment samples effectively demonstrated the working characteristics of indoor work sites.

4.4. Ablation Study on Network Improvements

To verify the effectiveness of the proposed improvements, including ODCE, FSAB, and MFIM, ablation experiments were conducted on these modules. The results show that CDPA-Net outperforms the second-best module that applies ODCE and MFIM in various metrics. Its precision is higher by 0.007, its recall is higher by 0.004, mAP50 is higher by 0.0017, mAP50-95 is higher by 0.035, and its F1 Score is higher by 0.006. Additionally, each experiment demonstrates performance improvements compared to the original baseline. Table 7 shows the ablation study results.

To further assess the contribution of each component in CDPA-Net, we conducted visual analysis by Grad-CAM technique to generate attention heatmaps [47]. These maps highlight the regions that the model focused on when making predictions. As shown in Figure 12, the baseline model and single-module variants exhibit limited focus and often misidentify background areas as targets. Specifically, the model fails to capture cigarette direction and smoker contour features without ODCE. The absence of FSAB leads to inadequate attention to high-frequency features. Similarly, removing MFIM results in a lack of contextual integration across scales. In contrast, the complete CDPA-Net shows enhanced focus on the key smoking regions, such as fingers holding the cigarette and mouth contours, even under challenging background conditions.

These results confirm that the individual modules alone are insufficient to extract reliable smoking features in complex indoor work site scenes. Only through the combined design of ODCE, FSAB, and MFIM can the model effectively suppress background interference and accurately localize subtle smoking cues. The visual evidence supports the effectiveness of the full CDPA-Net architecture and validates the design rationale behind the module integration.

4.5. Comparative Experiments on Different Datasets

To evaluate the cross-dataset generalization of CDPA-Net, we conducted experiments on the public SCAU-SD dataset. This dataset primarily consists of daily-life images with clean backgrounds and salient visual cues, representing a significantly different domain from cluttered industrial work sites. To ensure a rigorous assessment of transferability, the training process excluded all data from the ICSSD dataset. We compared our model against HOLT-Net [8], which utilizes a HOI paradigm. While HOLT-Net relies on close-up views and clear interactions, CDPA-Net maintains robustness across varying scales. The dataset and code are publicly available at https://github.com/JackKoLing/HOLT-Net (accessed on 15 November 2023).

In contrast, CDPA-Net is designed to handle the challenges posed by more complex and dynamic environments. Even when trained exclusively on the simpler SCAU-SD dataset, CDPA-Net outperforms HOLT-Net by a recall margin of 0.0301 and a mAP50 improvement of 0.102. CDPA-Net has 4.424M fewer parameters than HOLT-Net. Their FLOPs are almost the same. These results reflect the robustness and transferability of our model’s spectral–spatial design, particularly its ability to capture fine-grained features and contextual cues under various conditions. Table 8 presents a comparison of the performance metrics. Parameters that are unavailable in the original HOLT-Net implementation are denoted with “-”.

5. Discussion

In this study, we proposed CDPA-Net. It is a contour feature-driven detection framework. It targets real-time monitoring in indoor work sites. This architecture addresses key challenges like background interference. It also handles dim light and multi-scale features. CDPA-Net uses pose-aware features. It achieves robust detection performance for safety production.

The ODCE module enhances contour extraction. It captures the directional information of dangerous behaviors like smoking. Standard convolutions have limitations in capturing non-grid-aligned edges. ODCE uses NSCT to emphasize structural edges. It also boosts directional saliency. The design remains robust under occlusion. It effectively distinguishes behavior contours from background noise.

We introduced FSAB for fine-grained features. FSAB uses DCT to convert features to the frequency domain. This module integrates high-frequency textures under dim light. It highlights the edges of minute targets. The frequency-domain attention is selective. It enhances spectral components while suppressing lighting interference.

MFIM integrates features at different resolutions. It merges high-resolution spectral cues with low-resolution context. It uses adaptive attention for this fusion. This strengthens the detection of local details. It also preserves global information. CDPA-Net maintains high accuracy with multi-scale objects in dynamic environments.

CDPA-Net is designed with real-time applicability and lightweight deployment. The model achieves an inference speed of 112 FPS while maintaining high detection precision (mAP50 of 89.2%). This makes it suitable for deployment on edge devices such as the NVIDIA Jetson series. A future engineering implementation is envisioned with a cloud–edge collaborative architecture. Edge devices perform fast local inference, while cloud servers handle model updates and large-scale analysis. This setup reduces latency and bandwidth requirements while enabling scalable and responsive industrial safety monitoring. The proposed deployment framework is illustrated in Figure 13.

Finally, extensive experiments conducted on the newly constructed ICSSD dataset demonstrate the superior performance of CDPA-Net compared to state-of-the-art detection models. The ablation study confirms the effectiveness of each proposed module. CDPA-Net outperforms existing models in precision, recall, mAP, F1 score, and inference speed, particularly in small-object detection under conditions of complex lighting and a complex background. This confirms its suitability for real-world industrial application that requires reliable and efficient behavior monitoring.

6. Conclusions

CDPA-Net addresses safety challenges in indoor work sites. It handles small object detection for safety production. We proposed task-specific modules, including ODCE, FSAB, and MFIM. These modules leverage both spatial and frequency domain features. They effectively solve challenges like poor lighting and cluttered backgrounds. We validated the framework on indoor smoking detection. The results demonstrate high accuracy and real-time performance. Future work will focus on handling motion-degraded frames. We plan to incorporate temporal cues and synthetic motion augmentation. This research provides a strong foundation for safety monitoring. It offers a new technical framework for similar vision tasks.

Author Contributions

M.W.: Writing—review and editing, Writing—original draft, Visualization, Validation, Software, Methodology, Investigation, Conceptualization. M.L.: Writing—review and editing, Resources, Project administration, Methodology, Funding acquisition, Conceptualization. C.H.: Writing—review and editing, Project administration, Methodology, and Conceptualization. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Science Research Project of Hebei Education Department Research on Image Denoising Methods Based on Diffusion Model under Grant ZC2025081.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The author Meng Wang was employed by the company Beijing Qishan Chuangzhi Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CDPA-Net	Contour-Driven Pose-Aware Network.
ODCE	Orientation-Driven Contour Extractor.
FSAB	Frequency-Sensitive Attention Block.
MFIM	Multi-Scale Frequency Integration Module.
NSCT	Nonsubsampled Contourlet Transform.
DCT	Discrete Cosine Transform.
mAP	mean Average Precision.

References

Wang, Z.; Liu, Y.; Lei, L.; Shi, P. Smoking-YOLOv8: A novel smoking detection algorithm for chemical plant personnel. Pattern Anal. Appl. 2024, 27, 72. [Google Scholar] [CrossRef]
Shi, D.; Gan, S.; Zurada, J.; Guan, J.; Wang, F.; Weichbroth, P. A multi-model approach to construction site safety: Fault trees, Bayesian networks, and ontology reasoning. Expert Syst. Appl. 2025, 288, 127817. [Google Scholar] [CrossRef]
Fu, Y.; Ran, T.; Xiao, W.; Yuan, L.; Zhao, J.; He, L.; Mei, J. GD-YOLO: An improved convolutional neural network architecture for real-time detection of smoking and phone use behaviors. Digit. Signal Process. 2024, 151, 104554. [Google Scholar] [CrossRef]
Pandey, N.N.; Pati, A.; Maurya, R. DriSm_YNet: A breakthrough in real-time recognition of driver smoking behavior using YOLO-NAS. Neural Comput. Appl. 2024, 36, 18413–18432. [Google Scholar] [CrossRef]
Wang, M.; Li, M.; He, C. CCDM: Causality-guided contourlet diffusion models for contour-preserving image restoration in indoor work sites. Expert Syst. Appl. 2026, 306, 130912. [Google Scholar] [CrossRef]
Malagoli, E.; Di Persio, L. 2D Object Detection: A Survey. Mathematics 2025, 13, 893. [Google Scholar] [CrossRef]
Khan, A.; Somaiya Khan, B.H.; Ahmed, R.; Zheng, Z. SmokerViT: A Transformer-Based Method for Smoker Recognition. Comput. Mater. Contin. 2023, 77, 403–424. [Google Scholar] [CrossRef]
Ling, H.B.; Huang, D.; Cui, J.; Wang, C.D. HOLT-Net: Detecting smokers via human–object interaction with lite transformer network. Eng. Appl. Artif. Intell. 2023, 126, 106919. [Google Scholar] [CrossRef]
Wang, M.; Li, M. CASIA-Net: An indoor work site smoking detection framework. Comput. Ind. 2025, 173, 104383. [Google Scholar] [CrossRef]
Yang, J.; Sanjuan, M.A. Aperiodic Resonances-Theory and Applications; Springer: Berlin/Heidelberg, Germany, 2025. [Google Scholar]
Wang, Z.; Lei, L.; Shi, P. Smoking behavior detection algorithm based on YOLOv8-MNC. Front. Comput. Neurosci. 2023, 17, 1243779. [Google Scholar] [CrossRef]
Yang, J.; Rajasekar, S.; Sanjuán, M.A. Vibrational resonance: A review. Phys. Rep. 2024, 1067, 1–62. [Google Scholar] [CrossRef]
Shao, Y.; Sun, L.; Jiao, L.; Liu, X.; Liu, F.; Li, L.; Yang, S. CoT: Contourlet transformer for hierarchical semantic segmentation. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 132–146. [Google Scholar] [CrossRef] [PubMed]
Finder, S.E.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet convolutions for large receptive fields. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2025; pp. 363–380. [Google Scholar]
Khan, A.; Elhassan, M.A.; Khan, S.; Deng, H. Deep learning-based smoker classification and detection: An overview and evaluation. Expert Syst. Appl. 2025, 267, 126208. [Google Scholar] [CrossRef]
Rekavandi, A.M.; Xu, L.; Boussaid, F.; Seghouane, A.K.; Hoefs, S.; Bennamoun, M. A guide to image- and video-based small object detection using deep learning: Case study of maritime surveillance. IEEE Trans. Intell. Transp. Syst. 2025, 26, 2851–2879. [Google Scholar] [CrossRef]
Cheng, Q.; Cai, Z.; Lin, Y.; Li, J.; Lan, T. CE-FPN-YOLO: A Contrast-Enhanced Feature Pyramid for Detecting Concealed Small Objects in X-Ray Baggage Images. Mathematics 2025, 13, 4012. [Google Scholar] [CrossRef]
Sohan, M.; Sai Ram, T.; Rami Reddy, C.V. A review on yolov8 and its advancements. In Proceedings of the International Conference on Data Intelligence and Cognitive Informatics; Springer: Berlin/Heidelberg, Germany, 2024; pp. 529–545. [Google Scholar]
Wang, C.Y.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Computer Vision—ECCV 2024; Springer: Berlin/Heidelberg, Germany, 2024. [Google Scholar]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Khanam, R.; Hussain, M. YOLOv11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Wang, Z.; Li, C.; Xu, H.; Zhu, X. Mamba YOLO: SSMs-Based YOLO For Object Detection. arXiv 2024, arXiv:2406.05835. [Google Scholar] [CrossRef]
Chen, L.; Fu, Y.; Gu, L.; Yan, C.; Harada, T.; Huang, G. Frequency-Aware Feature Fusion for Dense Image Prediction. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 10763–10780. [Google Scholar] [CrossRef]
He, C.; Shi, H.; Liu, X.; Li, J. Interpretable physics-informed domain adaptation paradigm for cross-machine transfer diagnosis. Knowl.-Based Syst. 2024, 288, 111499. [Google Scholar] [CrossRef]
Ahmad, M.; Ghous, U.; Usama, M.; Mazzara, M. WaveFormer: Spectral–spatial wavelet transformer for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. 2024, 21, 5502405. [Google Scholar] [CrossRef]
Liu, M.; Jiao, L.; Liu, X.; Li, L.; Liu, F.; Yang, S.; Zhang, X. Bio-Inspired Multi-Scale Contourlet Attention Networks. IEEE Trans. Multimed. 2024, 26, 2824–2837. [Google Scholar] [CrossRef]
Li, X.; Lv, C.; Wang, W.; Li, G.; Yang, L.; Yang, J. Generalized Focal Loss: Towards Efficient Representation Learning for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 3139–3153. [Google Scholar] [CrossRef]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: A Simple and Strong Anchor-Free Object Detector. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 1922–1933. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar]
Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13619–13627. [Google Scholar]
Lv, W.; Zhao, Y.; Chang, Q.; Huang, K.; Wang, G.; Liu, Y. Rt-detrv2: Improved baseline with bag-of-freebies for real-time detection transformer. arXiv 2024, arXiv:2407.17140. [Google Scholar]
Vinh, A.T.; Vinh, N.M.; Si, N.L.; Quoc, D.D.; Khanh, D.T.; Hoang, N.N.; Do, T.; Ngo, T.D.; Le, D.D.; Satoh, S. A lightweight and data-centric framework for real-time object detection in fisheye camera. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Honolulu, HI, USA, 19–20 October 2025; pp. 5441–5448. [Google Scholar]
He, C.; Shi, H.; Liao, J.X.; Liu, B.; Liu, Q.; Li, J.; Yu, Z. Prior knowledge-embedded first-layer interpretable paradigm for rail transit vehicle human–computer collaboration fault monitoring. J. Ind. Inf. Integr. 2026, 51, 101068. [Google Scholar] [CrossRef]
Li, Z.; Zhang, J.; Zhang, Y.; Yan, D.; Zhang, X.; Woźniak, M.; Dong, W. FSDN-DETR: Enhancing fuzzy systems adapter with denoising anchor boxes for transfer learning in small object detection. Mathematics 2025, 13, 287. [Google Scholar] [CrossRef]
Xu, G.; Liao, W.; Zhang, X.; Li, C.; He, X.; Wu, X. Haar wavelet downsampling: A simple but effective downsampling module for semantic segmentation. Pattern Recognit. 2023, 143, 109819. [Google Scholar] [CrossRef]
Liao, Y.; Liu, S.; Gao, Y.; Zhang, A.; Li, Z.; Wang, F.; Li, B. PPDM++: Parallel Point Detection and Matching for Fast and Accurate HOI Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6826–6841. [Google Scholar] [CrossRef] [PubMed]
Ren, Z. A Novel Feature Fusion-Based and Complex Contextual Model for Smoking Detection. In Proceedings of the 2024 6th International Conference on Communications, Information System and Computer Engineering (CISCE), Guangzhou, China, 10–12 May 2024; pp. 1181–1185. [Google Scholar] [CrossRef]
Li, Y.; Zhou, H.; Feng, J.; Li, X.; Xu, X.; Hou, P.; Hu, X. An improved smoking behavior detection algorithm via incorporating an interference information filtering network. Eng. Appl. Artif. Intell. 2024, 136, 109050. [Google Scholar] [CrossRef]
Zeng, W.; Mao, G.; Li, M.; Yin, S. Deep learning-based object detection: A comprehensive review of YOLO, RCNN, and SSD series. Electron. Res. Arch. 2026, 34, 2674–2731. [Google Scholar] [CrossRef]
Qian, Q.; Zhang, J.; Luo, J.; Qin, Y. Integrated-Dispersion Manifold Distance: A New Distribution Discrepancy Metric for Machine Fault Transfer Diagnosis Under Time-Varying Conditions. IEEE Trans. Cybern. 2026, 56, 1687–1699. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: High Quality Object Detection and Instance Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1483–1498. [Google Scholar] [CrossRef] [PubMed]
Zhang, S.; Wang, X.; Wang, J.; Pang, J.; Lyu, C.; Zhang, W.; Luo, P.; Chen, K. Dense distinct query for end-to-end object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7329–7338. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]

Figure 1. Examples of indoor work sites that pose challenges for smoking detection. The four sub-figures respectively correspond to (a) various smoking postures, (b) dim light conditions, (c) complex working environments, and (d) multiple smokers.

Figure 2. Frequency domain features. (a) Original image; (b) directional filtering map; (c) frequency gradient map; (d) directional maps at 0°, 45°, 90°, and 135°.

Figure 3. Workflow of CDPA-Net. It includes (1) dataset construction, (2) addressing direction and contour recognition issue, (3) enhancement of the attention to critical features, and (4) validation of the constructed dataset to achieve excellent performance.

Figure 4. Comparison of two datasets. The top row presents samples from the ICSSD dataset, collected from indoor work sites. These images exhibit small smoking target regions relative to the entire frame and are characterized by cluttered, dynamic backgrounds. The bottom row shows samples from the SCAU-SD dataset, which were captured in daily life scenarios. Compared to ICSSD, the smoking targets occupy a larger area and appear against clean and relatively simple backgrounds.

Figure 5. The size distribution statistics of the two datasets. (a) The SCAU-SD dataset exhibits a wide size distribution, ranging from 0 to 600 pixels, with a predominance of large smoking targets. (b) The ICSSD dataset shows a narrower distribution, primarily between 0 and 300 pixels, consisting mainly of small targets that reflect the characteristics of indoor work site scenarios.

Figure 6. The structure of CDPA-Net.

Figure 7. The workflow of ODCE.

Figure 8. The structure of ODCE.

Figure 9. The structure of FSAB.

Figure 10. The structure of MFIM.

Figure 11. Comparison results of different detection models. The green circle represents that the smoking target has been detected. The red cross represents that the smoking target was not detected.

Figure 12. Visualization of the activation maps of CDPA-Net. (a) CDPA-Net represents all the proposed modules being applied in the framework. (b–g) represents that only the listed module is applied in the framework. (h) represents the output of the baseline.

Figure 13. Engineering application of indoor smoking detection based on CDPA-Net.

Table 1. Comparison of absolute and relative sizes of smoker objects between two datasets.

Item	ICSSD (Ours)	SCAU-SD
absolute width	40.308	179.786
absolute height	47.828	215.114
absolute area	2248.007	58,790.813
relative width	0.02702	0.0849
relative height	0.05675	0.3235
relative area	0.00182	0.0849

Table 2. Changes in the feature maps.

Level	Layer	Shape	Level	Layer	Shape
0	Conv	320 × 320 × 16	14	Conv	20 × 20 × 64
1	ODCE	320 × 320 × 16	15	MFIM	40 × 40 × 64
2	Conv	160 × 160 × 32	16	C3k2	40 × 40 × 64
3	FSAB	160 × 160 × 32	17	MFIM	80 × 80 × 64
4	Conv	80 × 80 × 64	18	C3k2	80 × 80 × 64
5	C3k2	80 × 80 × 128	19	Conv	80 × 80 × 64
6	Conv	40 × 40 × 128	20	BiFPN	80 × 80 × 64
7	C3k2	40 × 40 × 128	21	C3k2	80 × 80 × 64
8	Conv	20 × 20 × 256	22	Conv	40 × 40 × 64
9	C3k2	20 × 20 × 256	23	BiFPN	40 × 40 × 64
10	SPPF	20 × 20 × 256	24	C3k2	40 × 40 × 128
11	C2PSA	20 × 20 × 256	25	Conv	20 × 20 × 64
12	Conv	80 × 80 × 64	26	BiFPN	20 × 20 × 64
13	Conv	40 × 40 × 64	27	C3k2	20 × 20 × 256

Table 3. Software and hardware configuration table.

Configure	Name	Specific Information
Hardware Environment	GPU	NVIDIA GeForce RTX 4090
Hardware Environment	GPU Memory	24 GB
Software Environment	Operating System	Ubuntu 18.04
	Programming Language	Python 3.9
	Deep Learning Framework	PyTorch 2.0.1

Table 4. Training hyperparameter settings.

Parameter Name	Parameter Value
Image size	640 × 640 pixels
Optimizer	SGD
Epochs	300
Batch size	8
Initial Learning Rate	0.001
Weight Decay	$1 \times 10^{- 4}$
Momentum Factor	0.937
Confidence Threshold	0.45

Table 5. Comparison of detection accuracy among different models. Values in bold represent the best performance results.

Model	P	R	mAP50	mAP50-95	F1-Score
ATSS [31]	0.836	0.822	0.818	0.302	0.829
Cascade R-CNN [43]	0.779	0.840	0.799	0.287	0.808
DDQ [44]	0.765	0.858	0.818	0.298	0.809
D-FINE-m [34]	0.859	0.849	0.859	0.282	0.854
DINO [32]	0.814	0.870	0.801	0.286	0.841
Faster R-CNN [45]	0.779	0.839	0.796	0.279	0.808
FCOS [29]	0.829	0.794	0.772	0.269	0.811
GFL [28]	0.685	0.834	0.794	0.283	0.752
Mamba-YOLO-B [23]	0.877	0.847	0.851	0.296	0.862
RetinaNet [30]	0.665	0.825	0.786	0.278	0.736
RTDETRV2-R18 [33]	0.849	0.832	0.838	0.278	0.841
YOLOX [46]	0.557	0.678	0.630	0.199	0.611
YOLOV8m [18]	0.804	0.789	0.805	0.307	0.796
YOLOV9 [19]	0.828	0.804	0.832	0.323	0.816
YOLOV10m [20]	0.782	0.733	0.778	0.299	0.757
YOLO11m [21]	0.878	0.857	0.855	0.288	0.867
YOLO12s [22]	0.874	0.851	0.845	0.296	0.862
CDPA-Net (Ours)	0.898	0.880	0.892	0.343	0.889

Table 6. Comparison of parameters among different models. Values in bold represent the best performance results.

Model	FLOPs (G)	Size (MB)	Params (M)	FPS
ATSS [31]	5.35	151	28.27	65
Cascade R-CNN [43]	51.19	269	69.18	28
DDQ [44]	7.29	281	32.06	70
D-FINE-m [34]	24.819	157	10.18	67
DINO [32]	7.23	705	31.99	45
Faster R-CNN [45]	23.37	161	41.36	25
FCOS [29]	9.67	126	32.11	34
GFL [28]	10.05	125	32.26	31
Mamba-YOLOV-B [23]	49.7	39.4	20.501	136
RetinaNet [30]	10.1	141	36.37	29
RTDETRV2-R18 [33]	60	307.47	20.083	57
YOLOX [46]	0.93	40.2	5.03	79
YOLOV8m [18]	79.3	49.6	25.903	54
YOLOV9 [19]	263.9	465	60.501	24
YOLOV10m [20]	63.4	31.9	16.454	81
YOLO11m [21]	67.6	38.6	20.031	104
YOLO12s [22]	21.2	18	9.231	74
CDPA-Net (Ours)	7	4.89	2.396	112

Table 7. Comparison of ablation models. The symbol ✓ indicates that this module was used in the ablation experiment. A lack of the ✓ symbol represents the baseline model. The ✓ symbol represents CDPA-Net with all the improved modules. Values in bold represent the best performance results.

Baseline	ODCE	FSAB	MFIM	P	R	mAP50	mAP50-95	F1 Score
YOLO11n				0.853	0.816	0.835	0.287	0.834
	✓			0.872	0.853	0.861	0.295	0.862
		✓		0.869	0.849	0.855	0.291	0.859
			✓	0.865	0.843	0.842	0.289	0.854
	✓		✓	0.891	0.876	0.875	0.308	0.883
	✓	✓		0.871	0.834	0.862	0.304	0.8516
		✓	✓	0.869	0.849	0.855	0.291	0.859
	✓	✓	✓	0.898	0.880	0.892	0.343	0.889

Table 8. Validation on SCAU-SD dataset. Comparison results of CDPA-Net and HOLT-Net with (w) and without (w/o) the lite transformer module (LTM), HOLT-Net with (w) and without (w/o) the interaction detector branch (IDB). Values in bold represent the best performance results.

Model	P	R	mAP50	mAP50-95
HOLT-Net w/o LTM	-	0.7861	0.7496	-
HOLT-Net w LTM	-	0.8139	0.788	-
HOLT-Net w/o IDB	-	0.8083	0.7704	-
HOLT-Net w IDB	-	0.8139	0.788	-
CDPA-Net	0.833	0.844	0.890	0.531
Model	F1 score	Params (M)	FLOPs (G)	FPS
HOLT-Net w/o LTM	-	-	-	-
HOLT-Net w LTM	-	-	-	-
HOLT-Net w/o IDB	-	-	-	-
HOLT-Net w IDB	-	6.82	6.93	56
CDPA-Net	0.838	2.396	7	112

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, M.; Li, M.; He, C. CDPA-Net: An Indoor Work Sites Smoking Detection Framework Based on Contour-Driven Pose-Aware Feature Learning. Mathematics 2026, 14, 1462. https://doi.org/10.3390/math14091462

AMA Style

Wang M, Li M, He C. CDPA-Net: An Indoor Work Sites Smoking Detection Framework Based on Contour-Driven Pose-Aware Feature Learning. Mathematics. 2026; 14(9):1462. https://doi.org/10.3390/math14091462

Chicago/Turabian Style

Wang, Meng, Mei Li, and Chao He. 2026. "CDPA-Net: An Indoor Work Sites Smoking Detection Framework Based on Contour-Driven Pose-Aware Feature Learning" Mathematics 14, no. 9: 1462. https://doi.org/10.3390/math14091462

APA Style

Wang, M., Li, M., & He, C. (2026). CDPA-Net: An Indoor Work Sites Smoking Detection Framework Based on Contour-Driven Pose-Aware Feature Learning. Mathematics, 14(9), 1462. https://doi.org/10.3390/math14091462

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CDPA-Net: An Indoor Work Sites Smoking Detection Framework Based on Contour-Driven Pose-Aware Feature Learning

Abstract

1. Introduction

2. Related Works

2.1. Small Object Detection

2.2. Smoking Detection

3. Methodology

3.1. Dataset Construction and Analysis

3.2. Overall of CDPA-Net’s Architecture

3.3. ODCE Module

3.4. FSAB Module

3.5. MFIM Module

4. Experiment

4.1. Experimental Platform and Parameter Settings

4.2. Evaluation Metrics

4.3. Comparative Experiments of Different Models

4.4. Ablation Study on Network Improvements

4.5. Comparative Experiments on Different Datasets

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI