Early Wildfire Smoke Detection with a Multi-Resolution Framework and Two-Stage Classification Pipeline

Jung, Gihwan; Ahn, Tae-Hyuk; Min, Byungseok

doi:10.3390/fire9020092

Open AccessArticle

Early Wildfire Smoke Detection with a Multi-Resolution Framework and Two-Stage Classification Pipeline

by

Gihwan Jung

¹

,

Tae-Hyuk Ahn

¹

and

Byungseok Min

^2,*

¹

Department of Computer Science, Saint Louis University, Saint Louis, MO 63103, USA

²

Department of Artificial Intelligence Data Science, Sejong University, Seoul 05006, Republic of Korea

^*

Author to whom correspondence should be addressed.

Fire 2026, 9(2), 92; https://doi.org/10.3390/fire9020092

Submission received: 10 January 2026 / Revised: 8 February 2026 / Accepted: 17 February 2026 / Published: 19 February 2026

(This article belongs to the Topic AI for Natural Disasters Detection, Prediction and Modeling)

Download

Browse Figures

Versions Notes

Abstract

Early wildfire smoke detection is critical for preventing small ignitions from escalating into large-scale fires, yet early-stage smoke plumes are often faint, low-contrast, and spatially small. When full-resolution frames are resized to satisfy fixed-input detector architectures and enable efficient batched GPU inference, these subtle cues are further diminished, leading to missed detections and unreliable scores near deployment thresholds. Existing remedies such as multi-scale inference, slicing/tiling, or super-resolution could improve sensitivity, but typically incur substantial overhead from multiple forward passes or added network components, limiting real-time use on resource-constrained platforms. To mitigate these challenges, we propose a composite multi-resolution detection framework that improves sensitivity to small smoke regions while maintaining single-pass inference. Motivated by the fact that most operational wildfire monitoring systems rely on Unmanned Aerial Vehicle (UAV) platforms and mountain-top Closed-Circuit Television (CCTV) systems surveillance, their wide-field imagery typically contains a large sky region above the horizon where early smoke is most likely to first become visible. Accordingly, crop placement is guided by a skyline prior that prioritizes this high-probability sky band while retaining the remaining scene for global context. A dynamic compositing stage stacks a global view with a high-resolution, sky-aligned band into a standard square detector input, preserving context with minimal added cost. Detections from the two views are reconciled via coordinate restoration and non-maximum suppression. For deployment, a lightweight second-stage classifier selectively re-evaluates low-confidence detections to stabilize decisions near a fixed operating threshold without retraining the detector. Compared to the baseline detector, our approach improves detection performance on the Early Smoke dataset, achieving gains of +4.6 percentage points in AP @0.5:0.95, +3.4 percentage points in AP @0.5, +2.9 percentage points in precision, +5.3 percentage points in recall, and +4.3 percentage points in F1-score.

Keywords:

early smoke detection; wildfire monitoring; small object detection; multiresolution

1. Introduction

Early fire detection plays a crucial role in preventing small ignition events from escalating into large-scale wildfires that cause extensive ecological damage, economic loss, and threats to human life and infrastructure [1]. Once a fire grows beyond its incipient stage, suppression becomes significantly more difficult and costly, particularly in remote or mountainous regions where access is limited and response times are long [2]. As climate change and prolonged drought conditions increase both the frequency and intensity of wildfires, the ability to identify fire-related signals at the earliest possible moment has become a central requirement for modern wildfire management systems [3]. In operational settings, many detection systems rely on vision-based monitoring using unmanned aerial vehicles (UAVs) equipped with high-resolution red, green, and blue (RGB) cameras or fixed mountain-top closed-circuit television (CCTV) surveillance systems [4]. These platforms enable wide-area coverage and continuous observation without the need for dense ground sensor deployment, making single-image smoke detection a common and scalable paradigm [5]. However, early-stage smoke presents a uniquely challenging visual target: it often occupies only a very small portion of the image-a small fraction of the total area-while exhibiting low contrast, diffuse boundaries, and strong visual similarity to clouds, haze, or terrain [6]. Despite these difficulties, smoke is typically the first visible indicator of ignition, appearing well before flames are detectable, which makes robust early smoke detection a critical component of effective wildfire prevention and response [7].

With the rise of deep learning, vision-based object detectors became a central component of wildfire monitoring pipelines in both fixed-camera and aerial surveillance settings [8]. Convolutional neural network (CNN)-based architectures, including Region-based Convolutional Neural Networks (R-CNN) and You Only Look Once (YOLO) families, were widely deployed due to their favorable accuracy-efficiency characteristics, while more recent transformer-based detectors and hybrid CNN-transformer models further advanced robustness and contextual modeling across diverse vision tasks [9,10,11]. Building on these developments, contemporary wildfire detection systems achieved substantial progress under a range of operational conditions [12]. However, detecting early-stage smoke remains a persistent challenge, as nascent plumes are often faint, low-contrast, and spatially small, with diffuse boundaries that can closely resemble confounding visual phenomena such as clouds, haze, or terrain textures [13]. A practical consideration in many deployments is that high-resolution frames are resized to fixed input dimensions to satisfy architectural constraints and enable efficient batched inference, which can attenuate subtle smoke cues and reduce their salience in downstream feature representations [14]. To mitigate these effects, a broad set of resolution-aware and small-object detection strategies has been investigated, including multi-scale inference, high-resolution processing, and image slicing/tiling, and these approaches demonstrated measurable gains for small targets in related remote-sensing and surveillance contexts [15]. Nonetheless, because real-world wildfire monitoring systems must often operate under strict latency and resource constraints, further improvements remain desirable—particularly in the early-smoke regime-toward enhancing recall and stabilizing confidence behavior near deployment thresholds while maintaining computational practicality [12].

To address the limitations of existing wildfire detection systems in the early-smoke regime, we propose a cost-efficient composite detection framework that improves sensitivity to small, faint smoke while preserving single-pass inference. The approach reallocates spatial resolution by embedding a targeted high-resolution region within a standard detector input, guided by a skyline-based spatial prior to emphasize likely smoke regions while retaining global scene context. For deployment, we incorporate a lightweight confidence refinement step that operates only in a low-confidence range to improve prediction stability without changing training or standard evaluation. The proposed pipeline avoids expensive multi-stage slicing or multi-scale inference while remaining compatible with real-time, resource-constrained settings.

Our main contributions are:

A composite multi-resolution representation that embeds high-resolution detail within a standard detector input using a single forward pass.
A skyline-guided dynamic cropping strategy with reversible coordinate mapping for accurate localization in the original image space.
A deployment-time confidence refinement mechanism to stabilize low-confidence early-smoke predictions.
A comprehensive evaluation on early-smoke benchmarks (Early Smoke and Pyro-SDIS) demonstrating consistent gains in small-object and low-confidence regimes.

The remainder of this paper is organized as follows. The Section 2 reviews prior research in wildfire monitoring, deep learning for wildfire detection, and small-object detection with resolution manipulation. The Section 3 describes the proposed multi-resolution detection framework and the two-stage confidence refinement pipeline. The Section 4 details the datasets, training settings, and evaluation metrics. The Section 5 reports experimental results, including deployment-only evaluation, ablation studies, and generalizability verification. Finally, Section 6 summarizes the main findings and implications for early-smoke detection.

2. Related Work

2.1. Wildfire Monitoring Context

Wildfires arise from a combination of ignition sources and conducive environmental conditions, including low fuel moisture, high winds, and complex topography. Ignitions may be natural (e.g., lightning) or human-caused (e.g., equipment use, power infrastructure, and debris burning), and even small ignitions can rapidly evolve into large incidents under adverse weather and fuel conditions [16]. Once established, wildfires can spread quickly and produce severe societal and ecological impacts, including loss of life and property, disruption to critical infrastructure, long-term ecosystem damage, and widespread degradation of air quality due to smoke transport [17]. These risks motivate an operational emphasis on early warning, where detecting nascent smoke plumes can enable faster dispatch and containment before initial ignitions escalate. Operational wildfire management relies on a combination of prevention, preparedness, and rapid response, supported by monitoring and situational awareness across multiple sensing modalities. Existing monitoring approaches are broadly categorized into human-based and autonomous systems. Human-based monitoring includes lookout towers, patrol flights, and public or field personnel reporting, which remain valuable but are labor-intensive, costly to scale, and inherently limited in temporal continuity and spatial coverage [18]. These limitations motivate increasing interest in automated alternatives that can operate persistently.

Autonomous monitoring systems span a range of sensing platforms operating at different spatial and temporal resolutions. Satellite-based systems provide wide-area coverage and routine global observation, making them effective for regional situational awareness but often limited in spatial resolution and revisit frequency for early-stage detection [19]. Aerial platforms, including crewed aircraft and unmanned areial vehicles (UAVs), provide flexible, high-resolution sensing and rapid deployment capabilities, while their operational characteristics naturally introduce considerations related to cost, endurance, and weather dependence [20]. Ground-based vision systems, such as fixed CCTV camera networks deployed on towers or mountain ridges, enable continuous local surveillance and real-time visual confirmation, making them well suited for persistent monitoring of high-risk regions, while their effective coverage is influenced by terrain and camera placement [21]. Across these autonomous platforms, a wide range of analytical tools was explored. Early rule-based approaches relied on handcrafted color and motion heuristics to detect fire or smoke in video sequences [22]. Subsequent statistical learning approaches replaced fixed rules with engineered features and trained classifiers; for example, wavelet-based smoke characterization combined with support vector machines demonstrated reliable early detection with low false-alarm rates across surveillance scenarios [23]. More recently, artificial intelligence (AI)-driven surveillance systems based on machine learning and deep learning gained growing attention, enabling end-to-end automated detection, reducing human workload and improving the timeliness and robustness of wildfire monitoring across heterogeneous sensing modalities [24].

2.2. Deep Learning for Wildfire Monitoring

Recent deep learning-based wildfire monitoring research was organized by the primary vision task and system structure employed [25]. A first line of work formulated wildfire monitoring as an image-level classification problem, predicting whether a frame contains fire or smoke to enable rapid triage under limited computation. For example, Akagic et al. [26] proposed LW-FIRE, a lightweight CNN designed for real-time wildfire image classification with dataset transformation to improve generalization. Seydi et al. [27] introduced Fire-Net, a deep learning–based remote sensing framework that integrates optical and thermal information to enable accurate and robust detection of active forest fires across diverse environments, including small-scale fire events.

A second line focused on object detection, aiming to localize fire or smoke regions with bounding boxes for spatially explicit situational awareness and downstream tracking [28]. MModern vision systems for wildfire monitoring are commonly built on established object-detection families such as R-CNN and YOLO [9,10,11], which offered a practical balance between accuracy and real-time speed, alongside newer transformer-based and hybrid CNN-transformer designs that could better leverage broader scene context. Li et al. [29] proposed LEF-YOLO, a lightweight detector tailored for real-time identification of extreme wildfire cases and shown to perform effectively on a specialized dataset while maintaining fast inference. Bhargav and Singh [30] proposed a UAV-based wildfire detection framework that combines a CNN-based image classification model with YOLOv8 for object localization, demonstrating improved accuracy and computational efficiency over traditional methods while enabling timely fire detection from UAV imagery under challenging visual conditions.

A third line adopted semantic segmentation, producing pixel-level masks that could better represent irregular boundaries and diffuse phenomena, particularly for smoke plumes or elongated flame fronts. Jonnalagadda and Hashim [31] developed a segmentation-based UAV pipeline optimized for real-time flame detection. Khryashchev and Larionov [32] developed a satellite-image wildfire segmentation model based on U-Net with a ResNet34 encoder, trained with augmentation on high-resolution RGB imagery and evaluated using Dice and intersection over union (IoU) metrics to support early wildland fire detection.

Beyond single-task formulations, many practical systems used cascaded or multi-stage networks [33] that compose these tasks-most commonly using a lightweight classifier or coarse detector to propose candidate regions followed by a higher-capacity detector or segmenter to refine localization-improving robustness while controlling inference cost across high-resolution imagery. Ghali et al. [34] proposed a UAV-based wildfire framework that combines a deep ensemble classifier built on EfficientNet-B5 and DenseNet-201 with segmentation architectures including TransUNet, TransFire, and EfficientSeg. The framework enables accurate wildfire classification and precise pixel-level delineation, remaining robust to small fire regions and complex aerial backgrounds.

Collectively, these approaches have substantially improved the reliability of fire and smoke recognition in operational imagery; however, there remains significant scope for further improvement. In particular, a large portion of prior work is evaluated on scenarios in which the target signal is visually salient (e.g., large flames, dense smoke plumes, or fully developed fire fronts), where recognition cues are strong and localization is comparatively less ambiguous. Consequently, reported gains in classification accuracy, detection mean average precision (mAP), or segmentation overlap may not fully characterize performance in settings that are most critical for proactive monitoring, where the evidence is weaker and more readily confounded by background structure and atmospheric phenomena. Moreover, common efficiency-driven preprocessing practices—most notably resizing high-resolution frames to fixed input resolutions—may suppress fine-grained spatial detail that is informative for subtle visual signatures, motivating continued research on methods that generalize beyond prominent fire imagery while remaining computationally practical for real-time deployment.

2.3. Small Object Detection and Resolution Manipulation

Broadly, early wildfire detection shares key similarities with small object detection because the first visible cues-nascent smoke wisps or small flames-often occupy very few pixels and exhibit weak visual evidence. These cues are further degraded by downsampling and compression and are often confused with clouds, fog, dust, sun glare, or terrain textures, making them difficult to detect reliably in cluttered or low-contrast scenes. Remote-sensing surveys such as Cheng et al. [35] highlighted that even strong multi-scale and context-aware designs could struggle when targets became extremely small, which aligns closely with the "tiny smoke" regime encountered in early warning settings.

Consequently, detection systems for small objects tended to intervene on scale in two complementary ways. First, some works strengthened the detector itself to be more responsive to small targets, often through improved multi-scale handling and better localization behavior in the tiny-object regime [36,37,38,39]. Second, many practical pipelines manipulated the input or processing strategy so that early smoke or flame signals remained visible to the model without requiring uniformly higher resolution over the entire frame. Tiling and slicing approaches [40,41] and their refinements [42,43,44,45,46] reflected a common operational insight: allocating more spatial detail to candidate regions could materially improve detection when the event occupied only a small portion of the scene. Related region-adaptive methods similarly prioritized resolution where it mattered most by focusing computation on predicted areas of interest rather than treating all pixels equally [47].

Collectively, these directions improved small-target performance in fire monitoring, but they also exposed trade-offs that matter for early warning. Resolution-based pipelines may introduce additional runtime and engineering complexity due to repeated per-region inference and merging, and their performance may depend on design choices such as tile size and overlap. Meanwhile, detector-only improvements may raise sensitivity broadly across the full frame rather than explicitly preserving fine detail only in likely smoke regions. These issues are especially pronounced for early smoke, which is diffuse and low-contrast and may occupy only a handful of pixels, motivating wildfire-specific strategies that allocate resolution and computation selectively while maintaining real-time practicality.

Our Approach

In summary, prior wildfire detection systems have attempted to improve early-smoke performance mainly through (i) detector-centric architectural improvements for small objects, (ii) resolution-heavy strategies such as multi-scale inference or tiling/slicing, and (iii) multi-stage pipelines that cascade classification, detection, or segmentation. While these approaches can raise sensitivity, they often increase runtime and system complexity, and their effectiveness can be limited in the early-smoke regime where cues are faint, diffuse, and easily confounded—especially after high-resolution frames are resized to fixed detector inputs.

In contrast, the proposed framework targets this gap by selectively reallocating spatial resolution rather than uniformly increasing computation. It embeds a targeted high-resolution region within a standard detector input, guided by a skyline-based spatial prior, preserving fine detail in likely smoke regions while maintaining global context and single-pass inference (avoiding multi-tile or multi-scale passes). In addition, instead of adding training-time stages, it introduces a lightweight deployment-time confidence refinement applied only in a low-confidence range to stabilize borderline early-smoke predictions without changing training or standard evaluation, improving practicality for real-time, resource-constrained monitoring.

3. Methods

3.1. Overall Architecture

Figure 1 provides an overview of the proposed multi-resolution detection framework. The overall system is composed of three main components: (1) composite image generation with a sky-adaptive logic module for skyline-based cropping, (2) object detection on the composite input, and (3) a fusion stage that performs coordinate remapping, non-maximum suppression (NMS), and optionally a two-stage classifier for deployment. The following subsections describe each component in detail.

3.2. Composite Image Generation

Given an input image I, most modern object detectors operate on a fixed-size square input of

W_{i n} \times H_{i n}

to enable efficient batching and to maintain a consistent feature-map geometry throughout the network. When the aspect ratio of I differs from the required square shape, the common practice is to resize I while preserving its aspect ratio and then apply letterboxing, i.e., padding the remaining area (typically with constant-valued gray or black pixels) so that the final tensor matches the

W_{i n} \times H_{i n}

input constraint. These padded regions contribute negligible semantic information and primarily serve as a shape-normalization mechanism since they contain no scene content.

In this study, the otherwise uninformative padding area is repurposed as additional representational capacity. Instead of leaving this region as padding, the square canvas is filled with an auxiliary high-detail view of the same scene. This design preserves a global-context image while allocating more pixels to regions expected to contain early smoke signatures. The approach increases the effective spatial resolution available for small and visually ambiguous targets without changing the detector input size, modifying the backbone architecture, or incurring the computational overhead associated with multi-pass tiling or slicing-based inference (see Figure 2).

To construct the composite representation, the original image I is first resized to a width of

W_{i n}

pixels using bilinear interpolation while preserving aspect ratio, producing an image

I_{global}

of size

W_{i n} \times H_{global}

. A second resized version

I_{inter}

is generated from the same high-resolution source using bilinear interpolation at a larger intermediate width

S_{int}

, allowing the system to retain object-level details at a higher spatial resolution. These two resized images provide the global and high-resolution views used in the composite. The cropping window is set to match the unused vertical space after resizing, i.e., a window of size

W_{i n} \times (H_{i n} - H_{global})

, so that stacking it with

I_{global}

yields a final

W_{i n} \times H_{i n}

square. The vertical placement is determined using the estimated skyline position S, ensuring that the crop aligns with the region where smoke typically appears, and the crop is horizontally centered so that its midpoint aligns with the center of

I_{inter}

. Once the crop

I_{roi}

is extracted, it is placed above

I_{global}

to form the final composite image

I_{comp}

. The result is a

W_{i n} \times H_{i n}

image composed of a top band containing a high-resolution regional view derived from

I_{inter}

and a bottom band containing the global view.

During composite construction, a set of metadata parameters required for mapping detection boxes from composite coordinates back to the original image space is stored. These quantities capture the intermediate resizing scales, cropping boundaries, and any padding introduced during stacking. Let

$H_{gap}$ denotes the crop height,
$(m_{0}, n_{0})$ the horizontal and vertical crop offsets,
$s_{int}$ and $s_{W_{i n}}$ the scaling factors used for generating the intermediate-resolution and $W_{i n}$ -width images,
$(S_{int}, H_{inter})$ the dimensions of the intermediate resized image,
$o_{top}$ and $o_{bot}$ the vertical offsets introduced by cropping when mapping detections back to the original coordinate space.

These metadata variables are passed to the fusion stage, where they enable precise restoration of bounding boxes to the coordinate system of the original image I. The overall procedure is summarized in Algorithm 1.

Algorithm 1 Composite Image Generation Procedure.

Procedure GenerateComposite(I,

W_{i n}

,

H_{i n}

,

S_{int}

):

Input: I (RGB image),

W_{i n}

,

H_{i n}

(detector input size),

S_{int}

(intermediate resize width).

Output:

I_{comp} \in R^{W_{i n} \times H_{i n} \times 3}

(composite for detector inference),

M (metadata for coordinate restoration).

1. Global view (base canvas for detector input):

I_{global} \leftarrow ResizeWidth (I, W_{i n})

(preserve aspect ratio),

I_{global} \in R^{W_{i n} \times H_{global} \times 3}

.

2. Intermediate view (used to select a smoke-prior ROI):

I_{inter} \leftarrow ResizeWidth (I, S_{int}) (preserve aspect ratio)

,

I_{inter} \in R^{S_{int} \times H_{inter} \times 3}

.

3. Define the ROI size to occupy the unused letterbox region:

H_{gap} \leftarrow H_{i n} - H_{global}

,

(w_{roi}, h_{roi}) \leftarrow (W_{i n}, H_{gap})

.

4. Compute ROI location (skyline-guided, with fallback):

S \leftarrow SkylineRow (I_{inter})

.

if

S is undefined : S \leftarrow ⌊ η H_{inter} ⌋ (η \in (0, 1)

is a default prior).

m_{0} \leftarrow ⌊\frac{S_{int} - w_{roi}}{2}⌋, m_{1} \leftarrow m_{0} + w_{roi}

.

n_{0} \leftarrow clip (⌊S - β h_{roi}⌋, 0, H_{inter} - h_{roi}),

n_{1} \leftarrow n_{0} + h_{roi}

.

Define

{ROI}_{inter} \leftarrow [m_{0}, m_{1}) \times [n_{0}, n_{1})

.

5. Crop the intermediate view:

I_{roi} \leftarrow Crop (I_{inter}, {ROI}_{inter})

,

I_{roi} \in R^{W_{i n} \times H_{gap} \times 3}

.

6. Construct the detector input (single

W_{i n} \times H_{i n}

tensor):

I_{comp} \leftarrow {Concat}_{vertical} (I_{roi}, I_{global})

,

I_{comp} \in R^{W_{i n} \times H_{i n} \times 3}

.

7. Return composite and restoration metadata:

M \leftarrow \{W_{i n}, H_{i n}, S_{int}, H_{global}, H_{gap}, (m_{0}, n_{0}), resize scales for I_{global} and I_{inter}\}

.

return

(I_{comp}, M)

.

3.3. Dynamic Cropping

To obtain a stable reference point for placing the high-resolution crop, the skyline is estimated by leveraging color statistics in the YCrCb color space. The purpose of this module is to avoid placing the crop window without guidance and instead use a consistent structural feature of outdoor scenes. The sky generally occupies the upper portion of the image and exhibits distinctive luminance–chrominance values; therefore, the transition between sky and terrain serves as a practical reference for positioning the higher-resolution crop. Figure 3 illustrates the intermediate and final outputs of this procedure.

To identify the skyline region, the high-resolution input image

I_{inter}

is first downsampled by a factor

α \in (0, 1)

to reduce computational cost, producing

I_{inter, α}

whose width and height are scaled by

α

. The downsampled image

I_{inter, α}

is then converted from RGB to YCrCb. In practice, we set

α \in [0.125, 0.5]

based on empirical validation, which substantially reduces computation while preserving accurate skyline estimation.

A binary mask is subsequently constructed by labeling pixels as “sky” according to luminance and chrominance criteria. Specifically, a pixel is marked as sky if its luminance exceeds the global mean of the Y channel and its chrominance values fall within empirically determined thresholds. The RGB-to-YCrCb conversion is calculated as:

\begin{matrix} Y & = 0.299 R + 0.587 G + 0.114 B, \\ C b & = 128 - 0.168736 R - 0.331264 G + 0.5 B, \\ C r & = 128 + 0.5 R - 0.418688 G - 0.081312 B, \end{matrix}

where

R, G, B \in [0, 255]

.

To determine the

Cr

and

Cb

thresholds, we assume

Y \in [0, 255]

and partition the luminance range into three intervals using four breakpoints,

Y_{1} = 64

,

Y_{2} = 128

,

Y_{3} = 160

, and

Y_{4} = 192

. Sky regions are typically associated with higher luminance; therefore, pixels with

Y < Y_{1}

(i.e.,

Y < 64

) are discarded and are not considered as sky candidates. At each breakpoint

Y_{k}

, define empirically tuned chrominance thresholds

Θ^{(k)} = (τ_{Cb}^{\min, (k)}, τ_{Cb}^{\max, (k)}, τ_{Cr}^{\min, (k)}, τ_{Cr}^{\max, (k)}), k = 1, 2, 3, 4 .

Each breakpoint threshold set

Θ^{(k)}

is initialized from dataset statistics by taking the sample mean and standard deviation of the sky-pixel chrominance distributions at luminance level

Y_{k}

, and setting each bound to lie within two standard deviations of the mean (i.e.,

μ \pm 2 σ

). The final values are then selected empirically based on validation performance. For a pixel luminance

Y_{m, n}

, let

k \in {1, 2, 3}

satisfy

Y_{m, n} \in [Y_{k}, Y_{k + 1}]

, and define

α_{k} (Y_{m, n}) = \frac{Y_{m, n} - Y_{k}}{Y_{k + 1} - Y_{k}} \in [0, 1] .

Then the luminance-adaptive thresholds are given by piecewise linear interpolation:

\begin{matrix} τ_{Cb}^{\min} (Y_{m, n}) & = (1 - α_{k}) τ_{Cb}^{\min, (k)} + α_{k} τ_{Cb}^{\min, (k + 1)}, \\ τ_{Cb}^{\max} (Y_{m, n}) & = (1 - α_{k}) τ_{Cb}^{\max, (k)} + α_{k} τ_{Cb}^{\max, (k + 1)}, \\ τ_{Cr}^{\min} (Y_{m, n}) & = (1 - α_{k}) τ_{Cr}^{\min, (k)} + α_{k} τ_{Cr}^{\min, (k + 1)}, \\ τ_{Cr}^{\max} (Y_{m, n}) & = (1 - α_{k}) τ_{Cr}^{\max, (k)} + α_{k} τ_{Cr}^{\max, (k + 1)} . \end{matrix}

The binary sky mask M is defined as:

\begin{matrix} M_{m, n} = I [ & Y_{m, n} > μ_{Y} \land τ_{Cr}^{\min} (Y_{m, n}) \leq {Cr}_{m, n} \leq τ_{Cr}^{\max} (Y_{m, n}) \\ \land τ_{Cb}^{\min} (Y_{m, n}) \leq {Cb}_{m, n} \leq τ_{Cb}^{\max} (Y_{m, n})], \end{matrix}

where

μ_{Y}

denotes the average luminance across the image.

The row-wise profile is formalized by letting

r \in {0, 1, \dots, H_{inter, α} - 1}

denote the image row index (measured from the top). Let

n_{sky} (r)

be the number of pixels labeled as sky in row r. The row-to-row change in sky-pixel counts is then computed as

Δ n_{sky} (r) = n_{sky} (r + 1) - n_{sky} (r)

. This row-wise sky-pixel profile

n_{sky} (r)

, together with its discrete difference

Δ n_{sky} (r)

, forms the basis for detecting abrupt transitions used to estimate the skyline location.

To verify that this transition is meaningful, we compare the total sky-pixel mass above and below a candidate skyline row S using the row-wise sky counts

n_{sky} (r)

. Specifically, we compute

S_{above} = \sum_{r < S} n_{sky} (r)

and

S_{below} = \sum_{r > S} n_{sky} (r)

. If

S_{above} / S_{below} > {Th}_{s k y}

for a threshold

{Th}_{s k y}

, the candidate S is accepted as the skyline; otherwise, no skyline is returned. The estimated skyline position is then used to guide the vertical placement of the cropping window in the composite generation process.

Given a detected sky boundary at S, the crop window with height

H_{gap}

is placed such that

n_{top} = S - H_{gap} / 4

, ensuring that only a small portion of the crop window contains sky and the majority covers land or forest regions where smoke is more likely to originate. If no valid skyline is detected, the crop window is instead centered vertically within the image to provide a neutral fallback that avoids systematic bias.

3.4. Fusion: Remapping and Non-Maximum Suppression

Each detection need to be mapped back to the coordinate system of the original image, as inference is performed on the

W_{i n} \times H_{i n}

composite image. This remapping reverses the sequence of resizing, cropping, and padding operations applied during composite construction. For every predicted bounding box, the appropriate transformation branch is selected based on whether the detection originates from the top band (high-resolution crop) or the bottom band (global view). The metadata recorded during composite generation provides the offsets, crop coordinates, and scaling factors required for restoration.

({\hat{m}}_{1}, {\hat{n}}_{1}, {\hat{m}}_{2}, {\hat{n}}_{2})

is used to denote the bounding box coordinates in composite-image pixels, with

{\hat{m}}_{1}

and

{\hat{m}}_{2}

corresponding to the left and right column indices and

{\hat{n}}_{1}

and

{\hat{n}}_{2}

corresponding to the top and bottom row indices. The corresponding coordinates in the original image are computed by

m_{k}^{orig} = \frac{{\hat{m}}_{k} - o_{m} + c_{m}}{s_{m}}

,

n_{k}^{orig} = \frac{{\hat{n}}_{k} - o_{n} + c_{n}}{s_{n}}

, with

k \in {1, 2}

. where

(o_{m}, o_{n})

represent the padding offsets applied to the composite band,

(c_{m}, c_{n})

represent the crop offsets in the intermediate-resolution image, and

s_{m}, s_{n}

denote the scaling factors used to produce the resized inputs. For detections originating from the top band (the high-resolution crop), the remapping parameters are

o_{m} = o_{top}

,

o_{n} = 0

,

c_{m} = m_{0}

,

c_{n} = n_{0}

, and

s_{m} = s_{n} = s_{int}

. For detections originating from the bottom band (the globally resized image), the corresponding parameters are

o_{m} = o_{bot}

,

o_{n} = H_{gap}

,

c_{m} = 0

,

c_{n} = 0

, and

s_{m} = s_{n} = s_{W_{i n}}

. This ensures that all geometric transformations applied during composite construction are correctly reversed, allowing detections from both the high-resolution and global views to be restored to their precise locations in the coordinate frame of the original image I.

After remapping detections from both the top and bottom bands to the original image coordinates, duplicate bounding boxes may occur because the composite pipeline effectively simulates two views of the same scene. To consolidate these detections, standard NMS is applied with an IoU threshold

τ_{NMS}

. Let

b_{m}

denote the m-th bounding box,

s_{m}

its confidence score, and

IoU (b_{m}, b_{n})

the intersection-over-union between boxes

b_{m}

and

b_{n}

. Let N denote the total number of boxes. The IoU is defined as

IoU (b_{m}, b_{n}) = \frac{| b_{m} \cap b_{n} |}{| b_{m} \cup b_{n} |} .

Given an IoU threshold

τ \in (0, 1)

, the set of selected boxes after NMS is

S = \{m \in {1, \dots, N} : \forall n \in {1, \dots, N}, s_{n} > s_{m} \Rightarrow IoU (b_{m}, b_{n}) < τ_{NMS}\} .

The subset S forms the final set of detections for each image, expressed in the original coordinate frame.

3.5. Two-Stage Classifier

In deployment settings, a fixed confidence threshold is typically required to suppress false alarms. However, early-stage smoke plumes are typically faint, weakly textured, and visually diffuse, often occupying only a small portion of the image, which leads the detector to produce scores that cluster near the selected confidence threshold. Relying on a single cutoff therefore creates an inherent trade-off: raising the threshold improves precision but suppresses many true smoke detections, while lowering it increases recall at the cost of additional false positives. To alleviate this limitation, a lightweight two-stage classifier is introduced to refine only those detections whose confidence lies in an “uncertainty band” where YOLO predictions tend to be unreliable.

Detections with confidence below

T h_{low}

are treated as noise and discarded, whereas detections above

T h_{high}

are accepted without further processing. Only detections falling within the intermediate interval

[T h_{low}, T h_{high}]

are passed to the classifier. This selective refinement follows the intuition of cascade-style detection frameworks, where ambiguous proposals receive additional evaluation, but differs in that our classifier is applied solely during deployment and does not modify detector training. For each detection requiring refinement, a square crop of size

r_{cls} \times r_{cls}

is extracted from the original image such that the bounding-box center

(m_{c}, n_{c})

lies at the center of the crop. Centering ensures that the classifier receives the local smoke structure along with the immediate surrounding context, while using a fixed resolution provides stable appearance statistics regardless of the underlying smoke size. The classifier then predicts whether the crop corresponds to foreground (smoke) or background.

The final decision rule combines the detector confidence with the classifier output. Detections with

{conf}_{\det} < T h_{low}

are discarded outright, and detections with

{conf}_{\det} > T h_{high}

are accepted immediately. For detections in the intermediate band, the classifier determines whether they are retained or removed. Computational overhead is minimized by limiting this refinement to a small subset of detections, so the additional computation is incurred only sporadically rather than uniformly across frames. Consequently, the pipeline maintains real-time throughput while improving robustness in cases where the detector alone yields insufficiently reliable predictions.

Decide = \{\begin{matrix} DISCARD, & {conf}_{\det} < T h_{low}, \\ KEEP, & {conf}_{\det} > T h_{high}, \\ KEEP, & T h_{low} \leq {conf}_{\det} \leq T h_{high} and cls (d) = fg, \\ DISCARD, & otherwise . \end{matrix}

4. Experimental Setup

4.1. Datasets

4.1.1. Dataset Composition

Publicly available wildfire datasets contain relatively few instances of early-stage smoke, and the majority of images depict large, fully developed fires that are not representative of early-detection conditions. Early-smoke scenes are inherently rare in real-world collections because ignition events are seldom captured precisely at onset, and smoke plumes remain visually faint for only a brief temporal window. In addition, many datasets include large numbers of near-duplicate frames, which can artificially inflate evaluation metrics and obscure true detector performance, particularly for small, low-contrast objects. Even when early-smoke images are present, the number of such samples is typically insufficient to support effective training. To address these limitations, a curated Early Smoke dataset is constructed from the FASDD-UAV and D-Fire collections. The Pyro-SDIS dataset is also employed, after deduplication, as a secondary benchmark to assess cross-dataset reliability.

4.1.2. Datasets Used

FASDD–UAV. FASDD is a large-scale flame and smoke detection dataset composed of three subsets (CV, UAV, RS) collected from terrestrial and airborne sensors such as surveillance cameras, watchtowers, and UAVs [48]. The UAV subset, FASDD–UAV, is used, which contains 25,097 UAV-captured images. Each image is annotated with bounding boxes in four formats (YOLO, PASCAL Visual Object Classes (VOC), Microsoft Common Objects in Context (COCO), and Training Data Markup Language (TDML)) for two object classes, fire and smoke, totaling 36,308 fire instances and 17,222 smoke instances. Image aspect ratios are primarily between 4:3 and 16:9. The dataset offers high-resolution aerial coverage but includes many large plumes and near-duplicate video frames; as a result, early-smoke examples are relatively scarce.

D-Fire. D-Fire is an image dataset for fire and smoke detection, comprising 21,527 images with 26,557 annotated bounding boxes, of which 11,865 are labeled as smoke and 14,692 as fire [49]. The classes are intentionally broad: smoke can appear concentrated or highly diffused, with color and intensity varying according to wind conditions and the materials being burned. Images were collected from the Internet, controlled fire simulations in the Technological Park of Belo Horizonte, surveillance cameras at the Universidade Federal de Minas Gerais (UFMG), and cameras at the Serra Verde State Park in Belo Horizonte. Additional synthetic images were created by compositing smoke patches onto green landscape backgrounds, a form of augmentation previously shown to be effective for training on synthetic smoke images. Annotations follow the YOLO format with normalized coordinates in

[0, 1]

, and the dataset spans diverse fire and smoke patterns, scenarios, camera viewpoints, and lighting conditions. As with FASDD_UAV, many scenes contain large flames; while some high-resolution early-smoke examples exist, most images depict well-developed fires, thus additional filtering is necessary for early-smoke-focused experiments.

Pyro-SDIS. Pyro-SDIS is a wildfire smoke detection dataset developed with the French Fire and Rescue Services (SDIS) and volunteers from the Pyronear association [50]. Images are captured from fixed detection towers equipped with 4–5 high-resolution cameras and an on-site microcomputer, monitoring mostly forested environments in wildfire-prone areas. Smoke is typically far from the camera, small in the image, and often low-contrast under diverse lighting (from moderately dark to bright) and weather conditions (sunny, cloudy, hazy), making detection particularly challenging. The current release contains 33,636 images in total, including 28,103 images with smoke and 31,975 smoke instances, annotated in YOLO format by Pyronear volunteers. Unlike FASDD_UAV and D-Fire, Pyro-SDIS primarily captures early-stage wildfire smoke, but the raw data include many near-duplicate frames and some annotation inconsistencies, so additional preprocessing is required before use in this study.

4.1.3. Dataset Construction

Two derived datasets are constructed for the experiments. The Early Smoke dataset merges FASDD_UAV and D-Fire, then applies near-duplicate removal and smoke-focused filtering, while a separately deduplicated Pyro-SDIS split is built from the raw Pyro-SDIS release without additional smoke filtering.

Near-duplicate frames are removed to reduce metric inflation using a perceptual hashing (pHash) procedure based on low-frequency discrete cosine transform (DCT) features. Each image I is converted to grayscale and resized to

32 \times 32

pixels using bilinear interpolation, producing a normalized representation

I^{'}

that suppresses high-frequency variation. A 2D DCT is applied to

I^{'}

, and only the top-left

8 \times 8

coefficients are retained because they capture the dominant low-frequency structure. Let

D = {C (u, v) ∣ 0 \leq u, v < 8}

denote this block. Then, the median value m of all coefficients in D is computed, and a 64-bit binary hash is generated by setting

h (u, v) = \{\begin{matrix} 1, & C (u, v) > m, \\ 0, & C (u, v) \leq m, \end{matrix}

followed by flattening the result in row-major order. Similarity between two images is evaluated via the Hamming distance between their hashes:

d (H_{1}, H_{2}) = \sum_{k = 1}^{64} 1 (H_{1}^{k} \neq H_{2}^{k}) .

Images with

d (H_{1}, H_{2}) \leq 3

are considered duplicates in the Early Smoke dataset, and

d (H_{1}, H_{2}) \leq 10

flags duplicates within Pyro-SDIS. These empirically chosen thresholds, validated through extensive visual inspection, reliably remove near-identical frames—particularly those from consecutive UAV captures or repeated monitoring scenes—without discarding distinct examples.

After deduplication, additional filtering is applied to the Early Smoke dataset to isolate early-smoke instances. Pyro-SDIS is already curated for early-smoke detection and does not include the large-fire imagery commonly observed in FASDD–UAV and D-Fire; therefore, no additional smoke-specific filtering is applied to that split. This filtering procedure is intended to remove large-fire scenes while retaining images in which smoke is faint, spatially limited, and indicative of early ignition.

First, the annotated smoke region is required to occupy less than 2% of the total image area. Visual inspection of FASDD–UAV and D-Fire suggests that genuine early-smoke plumes typically occupy well under 1%; however, a 2% threshold is adopted to preserve coverage while excluding larger fire events. Second, each image is required to contain exactly one bounding box, consistent with the single ignition source expected in early-stage wildfires; images with multiple boxes are frequently associated with larger or fragmented fire scenes. Only images annotated with the smoke bounding-box label are retained, as visible flame is uncommon during the earliest ignition stages. The dataset is further restricted to high-resolution images with a longer side exceeding 1080 px, ensuring that multi-resolution composite generation can be performed without sacrificing spatial detail. These filtering steps yield a tightly curated Early Smoke set that reflects genuine early-smoke conditions and forms the foundation for evaluating fine-grained smoke detection.

4.1.4. Dataset Description and Statistics

The resulting datasets after the construction steps are summarized in Section 4.1. The filtered Early Smoke subset drawn from FASDD–UAV and D-Fire contains 936 images in which smoke is small (<2% of image area), faint, and set against varied backgrounds. The deduplicated Pyro-SDIS subset contains 3523 unique images captured by tower-mounted multi-camera systems in forested settings, where smoke typically appears distant and low-contrast under diverse lighting and weather conditions; all instances are annotated in YOLO format by Pyronear volunteers. Accordingly, two datasets are constructed:

1.: Early Smoke Dataset: 936 carefully filtered early-smoke images drawn from FASDD–UAV and D-Fire, emphasizing small, faint, low-contrast plumes with <2% image area under varying illumination and high background complexity.
2.: Pyro-SDIS Dataset: 3523 deduplicated images from tower-mounted camera systems in predominantly forested scenes, containing far-distant, low-contrast smoke under diverse lighting and weather conditions, with manually curated YOLO annotations by Pyronear volunteers.

The two datasets serve complementary roles in our evaluation. The Early Smoke Dataset provides clean, controlled early-smoke instances tailored for training and detailed analysis of model behavior on canonical early-stage plumes. The Pyro-SDIS subset introduces a meaningful distribution shift in terms of hardware, geography, appearance, and atmospheric conditions, making it well suited for testing robustness and generalization under realistic deployment scenarios. Qualitative examples for each dataset are illustrated in Figure 4 and Figure 5, respectively.

4.2. Training for Early Smoke Detection

All experiments are conducted on a single workstation environment; Table 1 summarizes the hardware and software configuration used consistently across detector and classifier training to ensure reproducibility.

The training pipeline proceeds in two stages. First, the composite-based detector is trained on both datasets using identical train/validation/test splits and the same composite-generation procedure used at inference to align distributions. Second, a post-hoc binary classifier is trained on detector proposals cropped from the original-resolution images to reduce false positives while preserving recall on faint, early-stage smoke.

4.2.1. Detector Training

The approach is evaluated using two curated datasets that focus on small, early-smoke instances. The first is the Early Smoke Dataset, consisting of 936 images obtained by filtering FASDD–UAV and D-Fire as described in Section 4.1. The second is the deduplicated Pyro-SDIS dataset, containing 3523 unique images.

All datasets are split at the image level into training, validation, and test sets using a 70/10/20 ratio. Test images are held out strictly for final evaluation and are never used for hyperparameter tuning; all model selection is performed using validation performance. Ground-truth annotations follow the standard YOLO format, consisting of class ID, box center, and box dimensions, all normalized to

[0, 1]

. During training and validation, each image is converted into a

W_{i n} \times H_{i n}

composite using the procedure described in Section 3. The composite dimension is set to

W_{i n} = H_{i n} = 640

, selected empirically and chosen to match the widely used

640 \times 640

detector input size in related object-detection pipelines.

Training is performed directly on composite images because inference in the deployed system is also conducted on composites. This alignment reduces distribution mismatch between training and deployment and avoids additional covariate shift introduced by composite generation. Test images, on the other hand, remain at their original resolution and are converted to composite form only at inference time. When creating a composite image for training and validation, the high-resolution crop is placed based on the ground-truth bounding box of the smoke instance, ensuring that the smoke region is fully contained within the crop window. This ensures that each annotated plume appears in both the global band and the high-resolution band of the composite, ensuring that the model learns from as many instances as possible. Under a fixed skyline-based inference policy, this strategy yields superior performance compared to training with skyline-based crop placement, which may not always capture the smoke region effectively during training. Table 2 presents an ablation study comparing these two training crop placement strategies under the same skyline-based inference setting.

To improve robustness to positional variation, three composite variants are generated for every training and validation image by shifting the crop horizontally so that the smoke region appears near the left (≈25% of width), center (≈50%), or right (≈75%). This augmentation reflects deployment conditions, where the sky-adaptive logic may place the crop at different horizontal locations depending on the skyline estimate. In addition to positional augmentation, the default Ultralytics YOLO augmentations are applied to the training split only.

YOLOv8 was selected as the baseline detector because its performance is effectively on par with YOLOv9 in this evaluation (only marginal differences in AP and best-F1), while offering greater operational maturity for real-world deployment. In safety-critical wildfire detection, the cost of unexpected behavior in the field such as instability across environments, brittle exports, inconsistent runtime characteristics, or unanticipated integration issues can outweigh small metric gains. Compared to newer variants, YOLOv8 benefits from broader ecosystem adoption and longer exposure in production-like pipelines, which typically translates to better-tested tooling, more deployment references, and lower integration risk [51]. Given the negligible accuracy gap observed here, prioritizing deployment maturity and robustness is the more conservative choice for reliable end-to-end operation. Table 3 presents an ablation study comparing performances across YOLO variants under a consistent evaluation protocol.

For all experiments, the detector is trained for 100 epochs with a batch size of 16 and an initial learning rate of

10^{- 2}

, initialized from pretrained weights. The Early Smoke Dataset is optimized using AdamW, whereas the Pyro–SDIS dataset is optimized using stochastic gradient descent (SGD), reflecting the configurations that yielded the most stable convergence in each setting. These hyperparameters follow standard YOLO training practice and support reliable optimization across both small, low–contrast early–smoke instances and far–distance smoke scenes present in Pyro-SDIS.

During inference, the YCrCb thresholds for skyline estimation (Section 4.1.4) are fixed to empirically tuned values that stabilize skyline detection on the training images. Inputs are assumed to be RGB images. The sky mask retains pixels whose

Cb

and

Cr

values fall within luminance-adaptive ranges, where the lower and upper thresholds are obtained by linear interpolation across four luminance breakpoints,

Y = {64, 128, 160, 192}

. These breakpoint ranges were determined empirically by inspecting the

Cb

/

Cr

distributions of pixels corresponding to sky regions in the images used in our datasets. Specifically, the breakpoint values are set to

Cb \in [115, 162]

and

Cr \in [110, 132]

at

Y = 64

,

Cb \in [129, 186]

and

Cr \in [96, 124]

at

Y = 128

,

Cb \in [121, 183]

and

Cr \in [93, 130]

at

Y = 160

, and

Cb \in [118, 168]

and

Cr \in [96, 136]

at

Y = 192

. For a pixel with luminance Y between adjacent breakpoints, each of

τ_{Cb}^{\min} (Y)

,

τ_{Cb}^{\max} (Y)

,

τ_{Cr}^{\min} (Y)

, and

τ_{Cr}^{\max} (Y)

is computed via piecewise linear interpolation between the corresponding endpoint values, and the pixel is labeled as sky if

τ_{Cb}^{\min} (Y) \leq Cb \leq τ_{Cb}^{\max} (Y)

and

τ_{Cr}^{\min} (Y) \leq Cr \leq τ_{Cr}^{\max} (Y)

. A visualization of these thresholds is shown in Figure 6. The acceptance-ratio threshold is set to

{Th}_{s k y} = 5.0

, and the skyline-estimation downsampling factor is set to

α = 0.25

. These settings remain constant across all reported experiments.

When applied to the early smoke dataset, the skyline estimation procedure accurately identifies the presence of sky in 75% of the images. In cases where the skyline is correctly detected, the resulting crop placement effectively captures the smoke region within the high-resolution band of the composite, leading to improved detection performance. When the skyline is not detected, the fallback strategy of centering the crop vertically still allows for reasonable coverage of the smoke region, albeit with slightly reduced performance compared to successful skyline detection.

4.2.2. Training for Early Smoke Classification

To refine detector outputs and suppress low-confidence false positives, a binary classifier is trained on localized image crops derived from detector predictions. The classifier operates on

224 \times 224

patches extracted from the original images and predicts whether each patch contains wildfire smoke.

The classifier is trained on the training splits of both the Early Smoke and Pyro-SDIS datasets. For each image, the same composite-generation pipeline used during detector inference is applied to maintain consistency between detector and classifier inputs. After the composite-based detector is trained, it is run on the training portions of both datasets and the resulting predicted bounding boxes are collected. These detector-generated proposals are then converted into

224 \times 224

crops extracted from the original-resolution images and centered on each predicted region. By using detector proposals rather than ground-truth boxes, the classifier is trained on the same type of inputs encountered during deployment. This process yields a diverse set of localized patches containing both true smoke plumes and detector-induced false positives.

Each crop is assigned a label based on its IoU with the ground-truth smoke annotations. Crops with

IoU > 0.5

are labeled as foreground (true smoke), whereas crops with

IoU < 10^{- 6}

are labeled as background (false positives). Crops with intermediate IoU values are discarded to mitigate ambiguity in supervision. After labeling, all crops are randomly partitioned into a 70/10/20 train/validation/test split, with splitting performed after crop generation to prevent leakage across splits. All classifier crops are extracted directly from the original-resolution images rather than from the composite representation. Cropping windows are centered on the detector’s predicted bounding boxes, with appropriate boundary handling applied when detections occur near image borders. The resulting crop dataset is class-imbalanced, as detector-generated proposals contain substantially more background patches than true smoke instances. To mitigate this imbalance, the foreground samples are augmented until the numbers of positive and negative crops are approximately matched. The augmentation pipeline applies horizontal flips with probability 0.5 and introduces mild geometric perturbations, including small translations, slight scale variations, and rotations up to

15^{\circ}

. These transformations increase sample diversity while preserving the visual characteristics of early-stage smoke. Augmentation is applied only to the training split to keep the validation and test distributions unchanged.

For all classifier experiments, a YOLOv8-s classification model is trained, initialized from the pretrained yolov8s-cls.pt weights for 100 epochs with a batch size of 32 and an initial learning rate of

10^{- 2}

. The model is optimized using the AdamW optimizer, which provided stable convergence on the balanced foreground–background crop dataset derived from detector proposals. These hyperparameters follow standard YOLO classification practice and are sufficient to train a robust classifier capable of discriminating true smoke plumes from detector false positives.

4.3. Evaluation Metrics

Detection performance is evaluated using standard object detection metrics. Precision measures the fraction of predicted detections that are correct, while recall measures the fraction of ground-truth objects that are successfully detected. Both are defined as below where

TP

,

FP

, and

FN

denote true positives, false positives, and false negatives, respectively. The F1 score, which summarizes precision and recall into a single harmonic-mean metric.

Precision = \frac{TP}{TP + FP} Recall = \frac{TP}{TP + FN} F 1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall}

Average Precision (AP) quantifies detection performance by integrating the precision–recall curve at a specified IoU threshold. Experiments report AP at fixed thresholds (e.g., AP @0.5) as well as the COCO-style AP averaged over IoU thresholds from 0.50 to 0.95 in increments of 0.05. mAP is computed as the mean of AP values across the specified IoU thresholds and is used as the primary detection metric.

In addition, size-specific COCO metrics are also reported, including AP_small, AP_medium, and AP_large. These categories correspond to objects with areas smaller than

32^{2}

, between

32^{2}

and

96^{2}

, and larger than

96^{2}

pixels, respectively. These metrics provide insight into detection performance across different object scales.

5. Results and Discussion

For completeness, performance is evaluated on the Early Smoke and Pyro–SDIS datasets using (i) standard scientific metrics and (ii) a deployment setting with a fixed confidence threshold and uncertainty-band refinement. The baseline is a YOLOv8s detector trained on

640 \times 640

letterboxed inputs without composite processing.

5.1. Overall Detection Performance

As shown in Table 4, our pipeline improves performance across all AP metrics, with the most notable gains in AP @0.5 and AP_small on the Early Smoke Dataset. The proposed method increases AP @0.5 from 0.818 to 0.852, resulting in a net improvement of 3.4%. The enhancement is most pronounced for small objects: AP_small increases from 0.054 to 0.212, representing nearly a fourfold improvement in the smallest-object regime. These improvements are more clearly reflected in the curves. As shown in Figure 7, the recall curves indicate relative gains of up to 60% in the high-confidence region (0.7–0.9). This behavior yields an overall increase of approximately 20% in average recall. These recall gains are accompanied by an expected reduction in precision of approximately 10% over the same confidence range. This trade-off is considered acceptable given the substantially larger recall improvement and the central importance of recall in early-smoke detection tasks. Additional evidence is provided by the

Δ

F1 curve, which indicates a net F1 increase of approximately 2%.

Beyond the baseline comparison, evaluations against other mainstream detector architectures indicate that the proposed pipeline consistently outperforms competing alternatives. As summarized in Table 5, the proposed method achieves the highest AP @0.5:0.95, AP @0.5, and AP @0.75. It also attains the best performance across all size-specific AP metrics. YOLOv5su exhibits strong AP @0.5 performance (0.839) but yields a comparatively lower

{AP}_{small}

. By contrast, YOLOv10s attains

{AP}_{small}

performance comparable to the proposed method (0.210) but underperforms on AP @0.5. The proposed method delivers balanced and superior results across metrics, supporting the effectiveness of composite-based training for early-stage wildfire smoke detection.

To complement the quantitative results, Figure 8 and Figure 9 present qualitative examples from the Early Smoke dataset. These examples illustrate the characteristic difficulty of the task. Smoke plumes are extremely small, visually faint, and frequently embedded in complex backgrounds. In many cases, baseline detectors either miss the plume entirely or produce low-confidence predictions. In contrast, the composite-trained model yields more consistent detections with higher confidence. This effect is most evident in scenes where the smoke occupies only a few dozen pixels. These examples provide visual evidence for the improvements reported in Table 4.

Collectively, these results demonstrate that the proposed pipeline effectively enhances detection performance on small smoke objects that are critical for early wildfire detection. By leveraging multi-resolution compositing during training, the detector learns to better capture fine-grained smoke features while retaining global context, leading to substantial improvements in both recall and AP metrics, particularly for the smallest smoke plumes that are most challenging to detect. This indicates that the pipeline is especially effective under the difficult visual conditions of early-smoke detection, where standard single-resolution models struggle to preserve sufficient fine-scale structure.

5.2. Deployment-Only Evaluation (Two-Stage Classifier)

In deployment settings, the detector operates under a fixed confidence threshold that balances false alarms and missed detections. Early-smoke predictions often lie near the decision boundary, where confidence values fluctuate and standard thresholding becomes unstable. To address this issue, a two-stage classifier is introduced to refine only detections whose confidence falls within a predefined uncertainty band. This refinement is applied exclusively at deployment time and does not affect the detector’s scientific evaluation reported in the previous section.

Confidence Threshold Selection

Figure 10 shows the confidence–recall and confidence–precision curves computed for the Early Smoke dataset. The detector exhibits a sharp drop in recall only below a confidence of approximately

0.2

, making

{Th}_{c o n f} = 0.2

a suitable operating threshold for recall-oriented deployment. The region to which the classifier is applied must be minimized, since the refinement stage introduces substantial computational overhead. Moreover, precision is relatively unstable in the low-confidence range because smoke plumes are often extremely small and visually faint. Accordingly, the uncertainty band for classifier refinement is selected by analyzing the confidence-score distribution of true positive detections around confidence values near 0.2. It is empirically observed that a large proportion of true detections fall within the

0.1

–

0.3

interval, reinforcing this range as the appropriate target for refinement. Hence, we set

{Th}_{l o w} = 0.1

and

{Th}_{h i g h} = 0.3

as the uncertainty band thresholds. These values balance recall preservation with minimal disruption to high-confidence detector outputs, while still enabling the classifier to correct uncertain cases.

Table 6 reports deployment-setting performance on the Early Smoke dataset under a fixed confidence threshold of

{Th}_{c o n f} = 0.2

. The multiresolution detector substantially outperforms the baseline YOLOv8 across all metrics, including

{AP}_{50 : 95}

,

{AP}_{75}

, and

{AP}_{S}

, establishing a stronger operating point under fixed-threshold filtering. When the two-stage classifier is applied to detections within the

0.1

–

0.3

uncertainty band, additional gains are observed. These gains are driven by more reliable handling of borderline predictions, which are most sensitive to hard-threshold decisions.

Notably, this selective refinement recovers true positives that would otherwise be discarded when using a single detector score for thresholding. In operational settings, detections near

{Th}_{c o n f}

often include visually ambiguous but correct early-smoke instances; the classifier provides an additional verification signal that can promote these candidates above the decision threshold while suppressing low-confidence false positives. As a result, the two-stage system improves overall AP and especially strengthens small-object performance without requiring a lower global threshold, making it more closely aligned with deployment constraints where fixed-threshold decisions are necessary. Figure 11 and Figure 12 present qualitative examples comparing baseline detections against the full pipeline with multiresolution detection and two-stage classification.

5.3. Ablation Study

To quantify the contribution of each module and to characterize the accuracy–efficiency trade-off of the complete system, we conduct an ablation study in which the multiresolution detector and the two-stage classifier are enabled individually and in combination. Such analysis is required to attribute performance gains to specific design choices and to identify the primary sources of additional latency under full deployment. In particular, because the second-stage classifier introduces extra computation, the ablation results provide empirical justification for a selective, event-driven invocation strategy, in which the classifier is applied only to ambiguous detector outputs rather than uniformly across all frames, thereby limiting overhead during background-dominated surveillance. Table 7 reports an ablation study analyzing the individual and combined effects of the multiresolution detector and the two-stage classifier on the Early Smoke dataset. The baseline YOLOv8 model exhibits very low performance on small objects (

{AP}_{S} = 0.037

), reflecting the extreme difficulty of detecting early-stage smoke plumes that occupy only a few pixels and lack strong visibility.

Introducing the two-stage classifier alone yields only marginal changes (slight gains in

{AP}_{50}

and

{AP}_{M}

) and does not lift

{AP}_{S}

, confirming that confidence refinement cannot recover missed tiny plumes without better spatial evidence. In contrast, enabling the multiresolution representation produces the dominant improvement:

{AP}_{S}

increases from 0.037 to 0.188 and higher-IoU scores (

{AP}_{75}

) also rise, indicating sharper localization as well as recall gains. Combining multiresolution detection with the two-stage classifier yields the best overall performance (

{AP}_{50 : 95} = 0.410

), with the classifier acting as a lightweight stabilizer in the

0.1

–

0.3

band rather than as a primary source of recall. These results underline that the high-resolution composite is the key driver for faint-smoke sensitivity, while the selective second stage mainly curbs borderline false alarms without suppressing true positives.

In addition to accuracy, Table 8 also reports end-to-end inference time for each configuration. While the full pipeline introduces higher worst-case latency due to the additional multiresolution processing and the conditional second-stage classifier, this cost is not incurred uniformly across frames in practical surveillance settings. The classifier is applied only to detections within a narrow confidence band (e.g.,

0.1

–

0.3

), which occur infrequently given the rarity of early smoke events and the predominance of background-only frames. For example, over a stream of 10,000 frames, if the classifier is triggered on only 10% of frames, the amortized per-frame runtime is approximately 9.5 ms, compared to 16.3 ms in the worst case. As a result, the most expensive component of the pipeline is triggered sparsely and in an event-driven manner, yielding an expected (amortized) runtime that remains close to that of the single-stage detector over long video streams. This design allows the system to prioritize sensitivity and robustness when ambiguous signals arise, while preserving efficient operation during normal background monitoring.

5.4. Generalizability Verification

To assess generalizability, the proposed pipeline is evaluated on Pyro-SDIS under an identical operating threshold and refinement band. Table 9 shows that the proposed model achieves the highest AP @0.5:0.95, AP @0.5, and AP @0.75 among mainstream detectors, indicating that the multiresolution composite and skyline-guided placement transfer beyond the primary dataset. Scale-specific gains on small and medium objects suggest that the same mechanisms that enhance faint-plume sensitivity on Early Smoke also improve performance on Pyro-SDIS, while AP for size-specific objects remains competitive. This pattern indicates robustness to distribution shifts in background, scene layout, and smoke appearance without dataset-specific tuning.

Table 10 reports deployment-setting performance on Pyro–SDIS using the same operating threshold (

{Th}_{c o n f} = 0.2

) and the same

0.1

–

0.3

refinement band. The multiresolution detector improves

{AP}_{50 : 95}

,

{AP}_{50}

, and

{AP}_{S}

over the baseline, and the selective two-stage pass adds incremental gains, reaching

{AP}_{50 : 95} = 0.415

,

{AP}_{50} = 0.700

, and

{AP}_{S} = 0.390

. The progression mirrors the primary dataset: multiresolution delivers most of the lift, and the classifier offers a modest stability boost, indicating that the deployment recipe transfers without dataset-specific tuning. In addition to the performance metrics, Table 11 reports the corresponding runtime performance.

Failure Analysis

While the proposed pipeline improves detection performance, several failure modes persist. The most common failures occur when smoke plumes are extremely faint, very small, or low-contrast, causing them to blend into complex backgrounds and leading to missed detections even with multiresolution input. Errors also arise when skyline estimation becomes unreliable under challenging conditions—such as cloud occlusion, strong lighting variation, or foreground occlusions—resulting in suboptimal crop placement that excludes critical plume details. In addition, the two-stage classifier can occasionally reject true positives when the smoke appearance is highly ambiguous, and false positives may occur when clouds or other scene elements exhibit smoke-like color and texture, particularly under overcast skies. Finally, ambiguous plume boundaries can produce inconsistent annotations, which complicates evaluation and can amplify apparent errors even when predictions are visually plausible. Figure 13 illustrates several representative failure cases, highlighting the ongoing challenges in accurately detecting early-stage wildfire smoke under diverse and difficult visual conditions.

6. Conclusions

The proposed multi-resolution pipeline addresses the central challenge of detecting faint, small-scale wildfire smoke. By embedding a high-resolution crop within a standard square input and aligning it with skyline-guided placement, the detector recovers fine-grained plume structure while retaining global context. Our approach outperforms the baseline on the Early Smoke dataset, improving AP @0.5:0.95 by +4.6 percentage points, mAP @0.5 by +3.4 percentage points, precision by +2.9 percentage points, recall by +5.3 percentage points, and F1 score by +4.3 percentage points. Ablation results show that this composite representation is the primary driver of the substantial AP gains on early-smoke instances, especially in

{AP}_{S}

, and that these benefits carry over to a secondary dataset without dataset-specific tuning. The selective two-stage classifier further stabilizes deployment behavior by refining detections only within a narrow uncertainty band, reducing false alarms near the operating threshold without altering scientific evaluation. The pipeline delivers measurable improvements in recall and localization for early smoke while keeping latency within practical bounds for field deployment. The combination of single-pass multiresolution detection, reversible coordinate remapping, and lightweight selective refinement provides a deployable balance between accuracy and efficiency for real-time wildfire monitoring.

However, several limitations remain. First, skyline estimation errors can lead to suboptimal placement of the high-resolution crop, potentially excluding critical plume regions. Because the skyline is derived using a simple edge-based heuristic, it is sensitive to cloud cover, illumination changes, and foreground occlusions. When the estimated skyline is inaccurate, the resulting crop can be misaligned, which may reduce detection performance. Second, although less frequent, smoke plumes may occur near the extreme lateral boundaries of the image and fall outside the high-resolution crop even when the skyline prior behaves as intended. While the global bottom-band view mitigates complete miss detections in such cases, the proposed pipeline may provide limited benefit because the plume is effectively observed only in the lower-resolution global context.

Future work will focus on improving the robustness and coverage of the multiresolution cropping strategy. A primary direction is to replace the edge-based skyline estimator with a more reliable, learning-based module such as lightweight segmentation module that is resilient to extreme conditions and can provide an uncertainty estimate to guide crop placement. In addition, we plan to mitigate boundary cases by adopting adaptive cropping policies—such as multiple high-resolution crops, dynamic crop widths, or lightweight tiling near image extremes while minimizing the computational cost such that salient plume regions are more consistently captured at higher resolution.

Author Contributions

Conceptualization, G.J. and B.M.; methodology, G.J. and B.M.; software, G.J.; validation, G.J.; formal analysis, G.J.; investigation, G.J.; resources, B.M.; data curation, G.J.; writing—original draft preparation, G.J.; writing—review and editing, G.J., T.-H.A. and B.M.; visualization, G.J.; supervision, B.M.; project administration, B.M. and T.-H.A.; funding acquisition, B.M. and T.-H.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Regional Innovation System & Education (RISE) through the Seoul RISE Center, funded by the Ministry of Education (MOE) and the Seoul Metropolitan Government (2025-RISE-01-019-04). In addition, Tae-Hyuk Ahn was supported by the National Science Foundation (NSF) under Grant No. 2430236.

Data Availability Statement

The curated Early Smoke dataset, the deduplicated Pyro-SDIS dataset, and the accompanying code will be released publicly at: https://github.com/vision-ai-lab/fire-detection (accessed on 9 January 2026).

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Arteaga, B.; Diaz, M.; Jojoa, M. Deep learning applied to forest fire detection. In Proceedings of the 2020 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT); IEEE: New York, NY, USA, 2020; pp. 1–6. [Google Scholar]
Sun, J.; Qi, W.; Huang, Y.; Xu, C.; Yang, W. Facing the Wildfire Spread Risk Challenge: Where Are We Now and Where Are We Going? Fire 2023, 6, 228. [Google Scholar] [CrossRef]
Vasconcelos, R.N.; Franca Rocha, W.J.; Costa, D.P.; Duverger, S.G.; Santana, M.M.d.; Cambui, E.C.; Ferreira-Ferreira, J.; Oliveira, M.; Barbosa, L.d.S.; Cordeiro, C.L. Fire detection with deep learning: A comprehensive review. Land 2024, 13, 1696. [Google Scholar] [CrossRef]
Zhao, Y.; Ma, J.; Li, X.; Zhang, J. Saliency detection and deep learning-based wildfire identification in UAV imagery. Sensors 2018, 18, 712. [Google Scholar] [CrossRef] [PubMed]
Wu, H.; Li, H.; Shamsoshoara, A.; Razi, A.; Afghah, F. Transfer learning for wildfire identification in UAV imagery. In Proceedings of the 2020 54th Annual Conference on Information Sciences and Systems (CISS); IEEE: New York, NY, USA, 2020; pp. 1–6. [Google Scholar]
Zhao, L.; Zhi, L.; Zhao, C.; Zheng, W. Fire-YOLO: A small target object detection method for fire inspection. Sustainability 2022, 14, 4930. [Google Scholar] [CrossRef]
Ahn, Y.; Choi, H.; Kim, B.S. Development of early fire detection model for buildings using computer vision-based CCTV. J. Build. Eng. 2023, 65, 105647. [Google Scholar] [CrossRef]
Huang, P.; Chen, M.; Chen, K.; Zhang, H.; Yu, L.; Liu, C. A combined real-time intelligent fire detection and forecasting approach through cameras based on computer vision method. Process Saf. Environ. Prot. 2022, 164, 629–638. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision; IEEE: New York, NY, USA, 2015; pp. 1440–1448. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Lv, C.; Zhou, H.; Chen, Y.; Fan, D.; Di, F. A lightweight fire detection algorithm for small targets based on YOLOv5s. Sci. Rep. 2024, 14, 14104. [Google Scholar] [CrossRef]
Xiao, Z.; Wan, F.; Lei, G.; Xiong, Y.; Xu, L.; Ye, Z.; Liu, W.; Zhou, W.; Xu, C. Fl-yolov7: A lightweight small object detection algorithm in forest fire detection. Forests 2023, 14, 1812. [Google Scholar] [CrossRef]
Zhang, L.; Wang, M.; Ding, Y.; Bu, X. MS-FRCNN: A multi-scale faster RCNN model for small target forest fire detection. Forests 2023, 14, 616. [Google Scholar] [CrossRef]
Wang, Y.; Bashir, S.M.A.; Khan, M.; Ullah, Q.; Wang, R.; Song, Y.; Guo, Z.; Niu, Y. Remote sensing image super-resolution and object detection: Benchmark and state of the art. Expert Syst. Appl. 2022, 197, 116793. [Google Scholar] [CrossRef]
Boroujeni, S.P.H.; Razi, A.; Khoshdel, S.; Afghah, F.; Coen, J.L.; O’Neill, L.; Fule, P.; Watts, A.; Kokolakis, N.M.T.; Vamvoudakis, K.G. A comprehensive survey of research towards AI-enabled unmanned aerial systems in pre-, active-, and post-wildfire management. Inf. Fusion 2024, 108, 102369. [Google Scholar] [CrossRef]
Andrianarivony, H.S.; Akhloufi, M.A. Machine learning and deep learning for wildfire spread prediction: A review. Fire 2024, 7, 482. [Google Scholar] [CrossRef]
Peng, Y.; Wang, Y. Automatic wildfire monitoring system based on deep learning. Eur. J. Remote Sens. 2022, 55, 551–567. [Google Scholar] [CrossRef]
Afghah, F. Autonomous Unmanned Aerial Vehicle Systems in Wildfire Detection and Management-Challenges and Opportunities. In Proceedings of the Dynamic Data Driven Application Systems; Springer Nature: Cham, Switzerland, 2022. [Google Scholar]
Lelis, C.A.S.; Roncal, J.J.; Silveira, L.; De Aquino, R.D.G.; Marcondes, C.A.C.; Marques, J.; Loubach, D.S.; Verri, F.A.N.; Curtis, V.V.; De Souza, D.G. Drone-Based AI System for Wildfire Monitoring and Risk Prediction. IEEE Access 2024, 12, 139865–139882. [Google Scholar] [CrossRef]
Bailon-Ruiz, R.; Bit-Monnot, A.; Lacroix, S. Real-time wildfire monitoring with a fleet of UAVs. Robot. Auton. Syst. 2022, 152, 104071. [Google Scholar] [CrossRef]
Phillips, W., III; Shah, M.; da Vitoria Lobo, N. Flame recognition in video. Pattern Recognit. Lett. 2002, 23, 319–327. [Google Scholar] [CrossRef]
Gubbi, J.; Marusic, S.; Palaniswami, M. Smoke detection in video using wavelets and support vector machines. Fire Saf. J. 2009, 44, 1110–1115. [Google Scholar] [CrossRef]
Mambile, C.; Kaijage, S.; Leo, J. Application of Deep Learning in Forest Fire Prediction: A Systematic Review. IEEE Access 2024, 12, 190554–190581. [Google Scholar] [CrossRef]
Ghali, R.; Akhloufi, M.A. Deep learning approaches for wildland fires remote sensing: Classification, detection, and segmentation. Remote Sens. 2023, 15, 1821. [Google Scholar] [CrossRef]
Akagic, A.; Buza, E. Lw-fire: A lightweight wildfire image classification with a deep convolutional neural network. Appl. Sci. 2022, 12, 2646. [Google Scholar] [CrossRef]
Seydi, S.T.; Saeidi, V.; Kalantar, B.; Ueda, N.; Halin, A.A. Fire-Net: A Deep Learning Framework for Active Forest Fire Detection. J. Sens. 2022, 2022, 8044390. [Google Scholar] [CrossRef]
Ramos, L.; Casas, E.; Bendek, E.; Romero, C.; Rivas-Echeverría, F. Computer vision for wildfire detection: A critical brief review. Multimed. Tools Appl. 2024, 83, 83427–83470. [Google Scholar] [CrossRef]
Li, J.; Tang, H.; Li, X.; Dou, H.; Li, R. LEF-YOLO: A lightweight method for intelligent detection of four extreme wildfires based on the YOLO framework. Int. J. Wildland Fire 2023, 33, WF23044. [Google Scholar] [CrossRef]
Bhargav, R.; Singh, P. Efficient UAV-Based Forest Fire Detection Using CNN and YOLOv8 Integration. In Proceedings of the 2025 6th International Conference on Recent Advances in Information Technology (RAIT); IEEE: New York, NY, USA, 2025; pp. 1–6. [Google Scholar]
Jonnalagadda, A.V.; Hashim, H.A. SegNet: A segmented deep learning based Convolutional Neural Network approach for drones wildfire detection. Remote Sens. Appl. Soc. Environ. 2024, 34, 101181. [Google Scholar] [CrossRef]
Khryashchev, V.V.; Larionov, R. Wildfire Segmentation on Satellite Images using Deep Learning. In Proceedings of the 2020 Moscow Workshop on Electronic and Networking Technologies (MWENT); IEEE: New York, NY, USA, 2020; pp. 1–5. [Google Scholar]
Bouguettaya, A.; Zarzour, H.; Taberkit, A.M.; Kechida, A. A review on early wildfire detection from unmanned aerial vehicles using deep learning-based computer vision algorithms. Signal Process. 2022, 190, 108309. [Google Scholar] [CrossRef]
Ghali, R.; Akhloufi, M.A.; Mseddi, W.S. Deep learning and transformer approaches for UAV-based wildfire detection and segmentation. Sensors 2022, 22, 1977. [Google Scholar] [CrossRef]
Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards large-scale small object detection: Survey and benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef]
Wang, J.; Yang, W.; Guo, H.; Zhang, R.; Xia, G.S. Tiny object detection in aerial images. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR); IEEE: New York, NY, USA, 2021; pp. 3791–3798. [Google Scholar]
Jiang, L.; Yuan, B.; Du, J.; Chen, B.; Xie, H.; Tian, J.; Yuan, Z. MFFSODNet: Multiscale feature fusion small object detection network for UAV aerial images. IEEE Trans. Instrum. Meas. 2024, 73, 3381272. [Google Scholar] [CrossRef]
Hu, M.; Li, Z.; Yu, J.; Wan, X.; Tan, H.; Lin, Z. Efficient-lightweight YOLO: Improving small object detection in YOLO for aerial images. Sensors 2023, 23, 6423. [Google Scholar] [CrossRef]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef] [PubMed]
Ozge Unel, F.; Ozkalayci, B.O.; Cigla, C. The Power of Tiling for Small Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops; IEEE: New York, NY, USA, 2019. [Google Scholar]
Akyon, F.C.; Altinuc, S.O.; Temizel, A. Slicing aided hyper inference and fine-tuning for small object detection. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP); IEEE: New York, NY, USA, 2022; pp. 966–970. [Google Scholar]
Zhang, H.; Hao, C.; Song, W.; Jiang, B.; Li, B. Adaptive slicing-aided hyper inference for small object detection in high-resolution remote sensing images. Remote Sens. 2023, 15, 1249. [Google Scholar] [CrossRef]
Hao, C.; Zhang, H.; Song, W.; Liu, F.; Wu, E. Slinet: Slicing-aided learning for small object detection. IEEE Signal Process. Lett. 2024, 31, 790–794. [Google Scholar] [CrossRef]
Telçeken, M.; Akgun, D.; Kacar, S. An evaluation of image slicing and YOLO architectures for object detection in UAV images. Appl. Sci. 2024, 14, 11293. [Google Scholar] [CrossRef]
Muzammul, M.; Li, X.; Li, X. Enhancing Tiny Object Detection Using Guided Object Inference Slicing (GOIS): An efficient dynamic adaptive framework for fine-tuned and non-fine-tuned deep learning models. Neurocomputing 2025, 640, 130327. [Google Scholar] [CrossRef]
Chen, Z.; Chen, G. STTSBI: A Fast Inference Framework for Small Object Detection in Ultra-High-Resolution Images. In Proceedings of the 2024 4th International Conference on Intelligent Technology and Embedded Systems (ICITES); IEEE: New York, NY, USA, 2024; pp. 129–135. [Google Scholar]
Koyun, O.C.; Keser, R.K.; Akkaya, I.B.; Töreyin, B.U. Focus-and-Detect: A small object detection framework for aerial images. Signal Process. Image Commun. 2022, 104, 116675. [Google Scholar] [CrossRef]
Wang, M.; Yue, P.; Jiang, L.; Yu, D.; Tuo, T.; Li, J. An open flame and smoke detection dataset for deep learning in remote sensing based fire detection. Geo-Spat. Inf. Sci. 2025, 28, 511–526. [Google Scholar] [CrossRef]
De Venâncio, P.V.A.; Rezende, T.M.; Lisboa, A.C.; Barbosa, A.V. Fire detection based on a two-dimensional convolutional neural network and temporal analysis. In Proceedings of the 2021 IEEE Latin American Conference on Computational Intelligence (LA-CCI); IEEE: New York, NY, USA, 2021; pp. 1–6. [Google Scholar]
Pyronear Team. Pyro-SDIS Dataset. Hugging Face. 2024. Available online: https://huggingface.co/datasets/pyronear/pyro-sdis (accessed on 9 January 2026).
Sapkota, R.; Karkee, M. Ultralytics YOLO evolution: An overview of YOLO26, YOLO11, YOLOv8 and YOLOv5 object detectors for computer vision and pattern recognition. arXiv 2025, arXiv:2510.09653. [Google Scholar]

Figure 1. Overview of the proposed multi-resolution detection framework.

Figure 2. Composite image generation pipeline diagram.

Figure 3. Step-by-step visualization of skyline detection. The module converts the image to YCrCb, identifies sky pixels via chrominance–luminance thresholds, and detects the sky-to-land transition through row-wise analysis. (a) Input image, (b) Sky pixel distribution, (c) Skyline detection, (d) ROI extraction.

Figure 4. Qualitative examples from the Early Smoke dataset.

Figure 5. Qualitative examples from the Pyro-SDIS dataset.

Figure 6. YCbCr planes at the four luminance breakpoints used to define the luminance-adaptive chrominance thresholds.

Figure 7. Comparison of baseline vs. composite detector performance on the Early Smoke dataset.

Figure 8. Qualitative detection examples on the Early Smoke dataset using baseline YOLOv8s detection. Bounding boxes are color-coded by detection outcome: blue denotes true positives and red denotes false positives.

Figure 9. Qualitative detection examples on the Early Smoke dataset using the composite-trained model. Composite training improves confidence stability and plume localization for small smoke plumes.

Figure 10. Confidence–recall and confidence–precision curves for the Early Smoke dataset. The selected deployment threshold of

{Th}_{c o n f} = 0.2

preserves recall while the

0.1

–

0.3

band defines the refinement region for the two-stage classifier.

Figure 10. Confidence–recall and confidence–precision curves for the Early Smoke dataset. The selected deployment threshold of

{Th}_{c o n f} = 0.2

preserves recall while the

0.1

–

0.3

band defines the refinement region for the two-stage classifier.

Figure 11. Qualitative ablation examples using the baseline detector. Bounding boxes are color-coded by outcome: blue denotes true positives and red denotes false positives.

Figure 12. Qualitative ablation examples using the full pipeline.

Figure 13. Qualitative failure cases for early-smoke detection. Green denotes ground truth, red denotes false positives, and blue denotes true positives.

Table 1. Hardware and software environment for all experiments.

Component	Specification	Manufacturer (City, Country)
GPU	NVIDIA RTX A5000 (24 GB)	NVIDIA Corporation (Santa Clara, CA, USA)
CPU & Memory	Intel Core i5-14400F; 64 GB RAM	Intel Corporation (Santa Clara, CA, USA)
OS	Ubuntu 24.04.3 LTS	Canonical Ltd. (London, UK)
CUDA version	12.6	NVIDIA Corporation (Santa Clara, CA, USA)
Python version	Python 3.10.18	Python Software Foundation (Beaverton, OR, USA)
PyTorch version	2.7.1 + cu126	Meta Platforms, Inc. (Menlo Park, CA, USA)

Table 2. Ablation comparing training crop placement strategies under a fixed skyline-based inference policy.

Model	AP @0.5:0.95	AP @0.5	${AP}_{small}$	F1	Precision	Recall
GT-based trained	0.420	0.852	0.212	0.832	0.881	0.787
Skyline-trained	0.397	0.845	0.113	0.812	0.832	0.793

Table 3. Detector comparison across YOLO variants using a consistent evaluation protocol.

Model	AP @0.50:0.95	AP @0.50	${AP}_{small}$	F1	Precision	Recall
YOLOv12	0.400	0.843	0.127	0.838	0.902	0.782
YOLOv11	0.389	0.795	0.222	0.821	0.839	0.803
YOLOv10	0.379	0.814	0.153	0.806	0.775	0.840
YOLOv9	0.421	0.858	0.219	0.835	0.864	0.809
YOLOv8	0.420	0.852	0.212	0.832	0.881	0.787
YOLOv5su	0.409	0.825	0.102	0.811	0.802	0.819

Table 4. Detection performance on the Early Smoke dataset.

Model	AP @0.5:0.95	AP @0.5	${AP}_{small}$	F1	Precision	Recall
Baseline	0.374	0.818	0.054	0.789	0.852	0.734
Ours	0.420	0.852	0.212	0.832	0.881	0.787

Table 5. Comparison against other mainstream detectors on the Early Smoke dataset.

Model	AP @0.5:0.95	AP @0.5	${AP}_{small}$	F1	Precision	Recall
YOLOv5su	0.384	0.839	0.136	0.823	0.844	0.803
YOLOv8s	0.374	0.818	0.054	0.789	0.852	0.734
YOLOv9s	0.321	0.776	0.077	0.776	0.844	0.718
YOLOv10s (SGD)	0.367	0.804	0.210	0.785	0.816	0.755
YOLOv11s	0.379	0.792	0.134	0.770	0.852	0.702
Ours	0.420	0.852	0.212	0.832	0.881	0.787

Table 6. Deployment evaluation on the Early Smoke dataset at a fixed threshold of

{Th}_{c o n f} = 0.2

, compared with the YOLOv8s baseline detector. The two-stage classifier refines detections in the

0.1

–

0.3

confidence interval.

Table 6. Deployment evaluation on the Early Smoke dataset at a fixed threshold of

{Th}_{c o n f} = 0.2

, compared with the YOLOv8s baseline detector. The two-stage classifier refines detections in the

0.1

–

0.3

confidence interval.

Experiment	AP @ 0.50:0.95	AP @ 0.50	${AP}_{small}$	F1	Precision	Recall
Baseline	0.345	0.760	0.037	0.789	0.852	0.734
Multiresolution	0.404	0.813	0.188	0.832	0.881	0.787
Multiresolution + Classifier	0.410	0.820	0.188	0.832	0.881	0.787

Table 7. Ablation study on the Early Smoke dataset evaluating the effects of multiresolution representation and the two-stage classifier. The classifier refines detections in the

0.1

–

0.3

confidence interval.

Table 7. Ablation study on the Early Smoke dataset evaluating the effects of multiresolution representation and the two-stage classifier. The classifier refines detections in the

0.1

–

0.3

confidence interval.

Multiresolution	Classifier	AP @ 0.50:0.95	AP @ 0.50	${AP}_{small}$	F1	Precision	Recall
X	X	0.345	0.760	0.037	0.789	0.852	0.734
X	O	0.343	0.768	0.037	0.806	0.804	0.809
O	X	0.404	0.813	0.188	0.832	0.881	0.787
O	O	0.410	0.820	0.188	0.832	0.881	0.787

Table 8. Inference-time performance for the ablation settings in Table 7.

Multiresolution	Classifier	Inference Time (ms)	FPS
X	X	3.168	315.66
X	O	7.515	133.07
O	X	5.825	171.67
O	O	11.708	85.41

Table 9. Comparative performance of multiresolution pipeline on the Pyro-SDIS dataset.

Model	AP @0.5:0.95	AP @0.5	AP @0.75	${AP}_{small}$	F1	Precision	Recall
YOLOv5su	0.419	0.714	0.445	0.400	0.700	0.683	0.719
YOLOv8s	0.418	0.737	0.423	0.405	0.729	0.745	0.714
YOLOv9s	0.406	0.703	0.405	0.361	0.687	0.698	0.676
YOLOv10s	0.413	0.713	0.432	0.391	0.699	0.750	0.655
YOLOv11s	0.410	0.718	0.417	0.367	0.691	0.691	0.690
Ours	0.436	0.749	0.447	0.410	0.726	0.777	0.681

Table 10. Deployment-setting performance on the Pyro-SDIS dataset under a fixed decision threshold of

{Th}_{c o n f} = 0.2

. The classifier is applied only to detections within the

0.1

–

0.3

confidence band.

Table 10. Deployment-setting performance on the Pyro-SDIS dataset under a fixed decision threshold of

{Th}_{c o n f} = 0.2

. The classifier is applied only to detections within the

0.1

–

0.3

confidence band.

Multiresolution	Classifier	AP @0.50:0.95	AP @0.50	AP @0.75	${AP}_{small}$	F1	Precision	Recall
X	X	0.386	0.664	0.404	0.365	0.724	0.758	0.692
X	O	0.401	0.667	0.424	0.375	0.734	0.708	0.762
O	X	0.410	0.691	0.426	0.386	0.726	0.777	0.681
O	O	0.416	0.703	0.431	0.394	0.726	0.777	0.681

Table 11. Runtime performance for the deployment settings in Table 10.

Multiresolution	Classifier	Average Inference Time (ms)	FPS
X	X	3.238	308.83
X	O	6.741	148.35
O	X	5.630	177.62
O	O	13.818	72.37

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jung, G.; Ahn, T.-H.; Min, B. Early Wildfire Smoke Detection with a Multi-Resolution Framework and Two-Stage Classification Pipeline. Fire 2026, 9, 92. https://doi.org/10.3390/fire9020092

AMA Style

Jung G, Ahn T-H, Min B. Early Wildfire Smoke Detection with a Multi-Resolution Framework and Two-Stage Classification Pipeline. Fire. 2026; 9(2):92. https://doi.org/10.3390/fire9020092

Chicago/Turabian Style

Jung, Gihwan, Tae-Hyuk Ahn, and Byungseok Min. 2026. "Early Wildfire Smoke Detection with a Multi-Resolution Framework and Two-Stage Classification Pipeline" Fire 9, no. 2: 92. https://doi.org/10.3390/fire9020092

APA Style

Jung, G., Ahn, T.-H., & Min, B. (2026). Early Wildfire Smoke Detection with a Multi-Resolution Framework and Two-Stage Classification Pipeline. Fire, 9(2), 92. https://doi.org/10.3390/fire9020092

Article Menu

Early Wildfire Smoke Detection with a Multi-Resolution Framework and Two-Stage Classification Pipeline

Abstract

1. Introduction

2. Related Work

2.1. Wildfire Monitoring Context

2.2. Deep Learning for Wildfire Monitoring

2.3. Small Object Detection and Resolution Manipulation

Our Approach

3. Methods

3.1. Overall Architecture

3.2. Composite Image Generation

3.3. Dynamic Cropping

3.4. Fusion: Remapping and Non-Maximum Suppression

3.5. Two-Stage Classifier

4. Experimental Setup

4.1. Datasets

4.1.1. Dataset Composition

4.1.2. Datasets Used

4.1.3. Dataset Construction

4.1.4. Dataset Description and Statistics

4.2. Training for Early Smoke Detection

4.2.1. Detector Training

4.2.2. Training for Early Smoke Classification

4.3. Evaluation Metrics

5. Results and Discussion

5.1. Overall Detection Performance

5.2. Deployment-Only Evaluation (Two-Stage Classifier)

Confidence Threshold Selection

5.3. Ablation Study

5.4. Generalizability Verification

Failure Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI