Firefighting Water Jet Trajectory Detection from Unmanned Aerial Vehicle Imagery Using Learnable Prompt Vectors

This research presents an innovative methodology aimed at monitoring jet trajectory during the jetting process using imagery captured by unmanned aerial vehicles (UAVs). This approach seamlessly integrates UAV imagery with an offline learnable prompt vector module (OPVM) to enhance trajectory monitoring accuracy and stability. By leveraging a high-resolution camera mounted on a UAV, image enhancement is proposed to solve the problem of geometric and photometric distortion in jet trajectory images, and the Faster R-CNN network is deployed to detect objects within the images and precisely identify the jet trajectory within the video stream. Subsequently, the offline learnable prompt vector module is incorporated to further refine trajectory predictions, thereby improving monitoring accuracy and stability. In particular, the offline learnable prompt vector module not only learns the visual characteristics of jet trajectory but also incorporates their textual features, thus adopting a bimodal approach to trajectory analysis. Additionally, OPVM is trained offline, thereby minimizing additional memory and computational resource requirements. Experimental findings underscore the method’s remarkable precision of 95.4% and efficiency in monitoring jet trajectory, thereby laying a solid foundation for advancements in trajectory detection and tracking. This methodology holds significant potential for application in firefighting systems and industrial processes, offering a robust framework to address dynamic trajectory monitoring challenges and augment computer vision capabilities in practical scenarios.


Introduction
Forest fires, or wildfires, have caused significant loss of life and property damage in recent years [1].They are classified as natural, caused by phenomena like lightning and dry weather, or human-caused, resulting from activities such as cooking and negligence.Approximately 90% of wildfires are human-caused [2], and the emission of toxic gases from these fires affects both humans and wildlife [3].The emphasis on public safety has driven the transformation of fire safety equipment towards automation and intelligence [4][5][6].Urban areas are increasingly mandating automatic fire suppression systems, and firefighting vehicles are becoming more automated [7][8][9].Hence, there is a growing emphasis on swiftly and precisely delivering firefighting water to the fire point, making it a focal point of research.Identifying and locating the fire source, as well as automatically controlling the water flow landing point, are pivotal components of the automatic fire suppression system during firefighting operations [10][11][12].Delivering firefighting water swiftly and accurately is crucial, yet research on water flow trajectory remains limited.Real-time monitoring of water flow trajectory is essential for effective firefighting and reducing response times.Recent advancements in large-scale language-vision models and the integration of prompt to address the challenges posed by harsh fire environments and potential data sampling damage by proposing a method that integrates UAV image characteristics with a technique based on offline modules to learn prompt vectors (OMPV) for jet trajectory detection.The PCV method enables the rapid acquisition of high-quality jet trajectory images through the flexibility, safety, and efficiency of UAVs.The subsequent sections of this paper are organized as follows: Section 2 details the PCV method, Section 3 presents experiments, Section 4 presents results and discussion, and Section 5 concludes our findings.
The major contributions of the paper can be summarized as follows: (1) Image enhancement techniques are used to significantly resolve the radiometric distortion and geometric distortion of the UAV-based jet trajectory images.Data augmentation, such as flipping and adjusting image brightness, hue, contrast, and saturation, can enhance image features to address distortion and geometric distortion.(2) An offline module of learnable prompt vectors is designed to improve classification scores for detected jet trajectories.The simplicity of jet trajectory leads to basic feature extraction, hindering accurate classification.Prompt vectors, as textual features, complement network inputs, thereby improving classification accuracy and enhancing detection reliability.(3) A method for detecting jet trajectory from UAV imagery using learnable prompt vectors is proposed.The approach has a significant advantage in locating jet trajectory in adverse environments and classifying jet trajectory in feature-scarce scenarios.

The Prompt Computer Vision Methods
The proposed method involves several preparatory steps, followed by the core components of the PCV method.The preparatory steps include UAV image capture and image preprocessing, while the core components are learnable prompt vector generation and the jet trajectory model.First, aerial images of jet streams are collected using UAVs, and data augmentation is applied to these images to enhance sample diversity and model robustness.Subsequently, prompt vectors for learning jet trajectory are generated to improve model generalization and acquire textual features of jet trajectory.Finally, the visually augmented features and learned textual features are input into a computer vision model for jet trajectory detection.All parts will be described in more detail in the following sections.

UAV Image Capture
Capturing high-quality images is crucial for the visual detection of jet trajectory.The UAV sensor device is specifically designed to capture jet trajectory images, as it is challenging to fully capture trajectory images during jetting.The UAV sensor device consists of a fuselage, propellers, and a visual sensor, with the visual sensor fixed on the UAV fuselage.During firefighting, the water jet landing point is far from the fire monitor, making it difficult to capture complete trajectory images at the fire monitor's location.Simultaneously, controlling the UAV sensor device to fly above the fire and maintaining a certain distance allows for the safe and efficient collection of jet trajectory images.
As depicted in Figure 1, the UAV is equipped with a visual sensor to capture jet trajectory images from distances of 30 m, 40 m, 50 m, and 60 m away from the fire scene.The frame rate, indicating the number of images captured per second, is set to 30 frames per second.Consequently, the visual sensor captures 30 images per second, and the resulting image size is 1280 pixels wide and 720 pixels high.

Jet Trajectory Datasets
To comprehensively simulate real fire incidents and collect data, the UAV conducted video recordings of jet trajectory from distances of 30 m, 40 m, 50 m, and 60 m away from the fire scene.Equipped with a visual sensor operating at 30 frames per second, each video had a duration of 2 min, ensuring sufficient coverage and detailed capture of the trajectory dynamics.Adhering to the principles of random sampling, 50 frames of jet trajectory images were extracted from each video, ensuring that the data distribution follows an independent distribution pattern.This approach enhances the diversity and representativeness of the dataset, enabling the machine learning models to generalize better across different scenarios and conditions.Moreover, by capturing trajectory images from varying heights, the dataset encompasses a wide range of perspectives, allowing for the analysis of trajectory behavior under different altitude conditions.Additionally, the use of UAVs provides the advantage of flexibility and mobility, enabling the collection of data from different locations and angles, which is crucial for understanding and predicting fire behavior accurately.Furthermore, the high-resolution images captured by the visual sensor ensure detailed observation and analysis of the trajectory morphology, facilitating the development of robust detection and prediction algorithms.Overall, this method of UAV-based data collection combines precision, efficiency, and versatility, making it a valuable tool for fire research and emergency response planning.Image Conversion Algorithm snippets extract frames from a video file, saving them as individual images.It first opens the input video, determines the total number of frames, and calculates the interval for frame extraction.Then, it iterates through each frame, saving frames at specified intervals as separate images until the end of the video is reached.Finally, it releases the video capture resource.Now, we have obtained a training dataset consisting of 200 images of jet trajectory, with the data distribution being independent.Similarly, we perform the same operation to acquire 200 images for the test dataset.
After obtaining the images, the next step involves annotating them, a process vital for training machine learning models.We used the labeling tool named labelimg, a popular annotation tool in the computer vision community, to label the images.This tool enables us to draw bounding boxes around the regions of interest, such as the trajectory of the water jet in our case, and assign corresponding labels.This annotated dataset serves as the foundation for training and evaluating our jet trajectory detection model.Figure 2 illustrates the image processing pipeline applied to the UAV-captured images.In Figure 2a, we present the original image obtained from the UAV sensor, showcasing the raw data captured during flight.Subsequently, in Figure 2b, we demonstrate the image after

Image Preprocessing 2.2.1. Jet Trajectory Datasets
To comprehensively simulate real fire incidents and collect data, the UAV conducted video recordings of jet trajectory from distances of 30 m, 40 m, 50 m, and 60 m away from the fire scene.Equipped with a visual sensor operating at 30 frames per second, each video had a duration of 2 min, ensuring sufficient coverage and detailed capture of the trajectory dynamics.Adhering to the principles of random sampling, 50 frames of jet trajectory images were extracted from each video, ensuring that the data distribution follows an independent distribution pattern.This approach enhances the diversity and representativeness of the dataset, enabling the machine learning models to generalize better across different scenarios and conditions.Moreover, by capturing trajectory images from varying heights, the dataset encompasses a wide range of perspectives, allowing for the analysis of trajectory behavior under different altitude conditions.Additionally, the use of UAVs provides the advantage of flexibility and mobility, enabling the collection of data from different locations and angles, which is crucial for understanding and predicting fire behavior accurately.Furthermore, the high-resolution images captured by the visual sensor ensure detailed observation and analysis of the trajectory morphology, facilitating the development of robust detection and prediction algorithms.Overall, this method of UAVbased data collection combines precision, efficiency, and versatility, making it a valuable tool for fire research and emergency response planning.Image Conversion Algorithm snippets extract frames from a video file, saving them as individual images.It first opens the input video, determines the total number of frames, and calculates the interval for frame extraction.Then, it iterates through each frame, saving frames at specified intervals as separate images until the end of the video is reached.Finally, it releases the video capture resource.Now, we have obtained a training dataset consisting of 200 images of jet trajectory, with the data distribution being independent.Similarly, we perform the same operation to acquire 200 images for the test dataset.
After obtaining the images, the next step involves annotating them, a process vital for training machine learning models.We used the labeling tool named labelimg, a popular annotation tool in the computer vision community, to label the images.This tool enables us to draw bounding boxes around the regions of interest, such as the trajectory of the water jet in our case, and assign corresponding labels.This annotated dataset serves as the foundation for training and evaluating our jet trajectory detection model.Figure 2 illustrates the image processing pipeline applied to the UAV-captured images.In Figure 2a, we present the original image obtained from the UAV sensor, showcasing the raw data captured during flight.Subsequently, in Figure 2b, we demonstrate the image after undergoing label enhancement techniques.These enhancements serve to refine the image quality and emphasize pertinent features, facilitating more accurate analysis and interpretation.

Image Enhancement
Jet trajectory images may suffer from radiometric distortion [31] and geometric distortion [32], where radiometric distortion leads to unrealistic brightness and color in the images, while geometric distortion may cause shifts or deformations in the positions and shapes of the jet trajectory within the images.To address radiometric distortion, radiometric correction methods adjust the exposure and color balance of the images to restore their true radiometric information, while geometric correction methods manipulate the images through rotation, scaling, and translation to correct the positions and shapes of the jet trajectory, making them more accurate and reliable.
As shown in Figure 3, upon receiving the jet trajectory images, the initial step involves resizing each image to dimensions of 1000 and 600 to address potential geometric distortions and ensure uniformity and compatibility across the dataset.Following this, augmentation techniques are applied, where images are randomly flipped horizontally with a probability of 0.5.This flipping process not only enhances the dataset's diversity but also aids in mitigating geometric distortions, particularly when dealing with varied orientations of jet trajectory in real-world scenarios.Additionally, resizing facilitates

Image Enhancement
Jet trajectory images may suffer from radiometric distortion [31] and geometric distortion [32], where radiometric distortion leads to unrealistic brightness and color in the images, while geometric distortion may cause shifts or deformations in the positions and shapes of the jet trajectory within the images.To address radiometric distortion, radiometric correction methods adjust the exposure and color balance of the images to restore their true radiometric information, while geometric correction methods manipulate the images through rotation, scaling, and translation to correct the positions and shapes of the jet trajectory, making them more accurate and reliable.
As shown in Figure 3, upon receiving the jet trajectory images, the initial step involves resizing each image to dimensions of 1000 and 600 to address potential geometric distortions and ensure uniformity and compatibility across the dataset.Following this, augmentation techniques are applied, where images are randomly flipped horizontally with a probability of 0.5.This flipping process not only enhances the dataset's diversity but also aids in mitigating geometric distortions, particularly when dealing with varied orientations of jet trajectory in real-world scenarios.Additionally, resizing facilitates efficient processing and analysis of the images, ensuring consistency in feature extraction and model performance across different platforms and computational environments.Figure 4 provides a visual representation of the image-flipping process.The top row displays the original images, while the bottom row exhibits the images after undergoing the flipping transformation.
Sensors 2024, 24, x FOR PEER REVIEW 6 of 18 efficient processing and analysis of the images, ensuring consistency in feature extraction and model performance across different platforms and computational environments.
Figure 4 provides a visual representation of the image-flipping process.The top row displays the original images, while the bottom row exhibits the images after undergoing the flipping transformation.Using the specified data augmentation parameters addresses the issue of photometric distortion in jet trajectory images.By adjusting parameters such as brightness (increased by 32 units), contrast (varied within the range of 0.5 to 1.5), saturation (adjusted within the range of 0.5 to 1.5), and hue (increased by 18 units), the photometric characteristics of the images are adjusted to enhance their authenticity and quality.Employing this parameterized data augmentation method is essential, as jet trajectory images are susceptible to photometric distortions, leading to inconsistencies in brightness, contrast, saturation, and hue.Fine-tuning these parameters ensures sharper and more realistic images, thereby enhancing data quality and facilitating the training and performance improvement of subsequent fire detection models.Figure 5 illustrates the visual effects of photometric enhancement applied to the images.The top row depicts the original images, while the bottom row showcases the images after undergoing photometric enhancement.This technique aims to improve image quality by adjusting    Using the specified data augmentation parameters addresses the issue of photometric distortion in jet trajectory images.By adjusting parameters such as brightness (increased by 32 units), contrast (varied within the range of 0.5 to 1.5), saturation (adjusted within the range of 0.5 to 1.5), and hue (increased by 18 units), the photometric characteristics of the images are adjusted to enhance their authenticity and quality.Employing this parameterized data augmentation method is essential, as jet trajectory images are susceptible to photometric distortions, leading to inconsistencies in brightness, contrast, saturation, and hue.Fine-tuning these parameters ensures sharper and more realistic images, thereby enhancing data quality and facilitating the training and performance improvement of subsequent fire detection models.Figure 5 illustrates the visual effects of photometric enhancement applied to the images.The top row depicts the original images, while the bottom row showcases the images after undergoing photometric enhancement.This technique aims to improve image quality by adjusting Using the specified data augmentation parameters addresses the issue of photometric distortion in jet trajectory images.By adjusting parameters such as brightness (increased by 32 units), contrast (varied within the range of 0.5 to 1.5), saturation (adjusted within the range of 0.5 to 1.5), and hue (increased by 18 units), the photometric characteristics of the images are adjusted to enhance their authenticity and quality.Employing this parameterized data augmentation method is essential, as jet trajectory images are susceptible to photometric distortions, leading to inconsistencies in brightness, contrast, saturation, and hue.Fine-tuning these parameters ensures sharper and more realistic images, thereby enhancing data quality and facilitating the training and performance improvement of subsequent fire detection models.Figure 5 illustrates the visual effects of photometric enhancement applied to the images.The top row depicts the original images, while the bottom row showcases the images after undergoing photometric enhancement.This technique aims to improve image quality by adjusting brightness, contrast, and color balance, thereby enhancing the interpretability and analysis of the images.
brightness, contrast, and color balance, thereby enhancing the interpretability and analysis of the images.

Definition of Prompt Learning
Prompt learning is a machine learning paradigm aimed at improving the performance and generalization ability of models by incorporating external prompt information to guide their learning process.These prompts can be specially designed vectors or automatically extracted relevant information from data to assist models in better understanding and processing input data.Prompt learning is widely applied in fields such as computer vision, natural language processing, and reinforcement learning, providing additional context and semantic information to models to tackle complex realworld problems.
The pioneering study to introduce prompt learning to the field of computer vision was CLIP [15].During training, CLIP uses a contrastive loss to train a shared embedding space for both images and text.In a minibatch of image-text pairs, CLIP maximizes the cosine similarity between each image and its corresponding text while minimizing similarities with all other unmatched texts.This process is repeated for each text as well.Following training, CLIP enables zero-shot image recognition.Let x represent the image features from the image encoder, and {wi} N i=1 denote a set of weight vectors from the text encoder, each corresponding to a category (assuming there are N categories in total).These weight vectors are derived from prompts, such as "a photo of a {class}", where the "{class}" token is replaced with the i-th class name.Subsequently, the prediction probability is calculated.
Sim() denotes cosine similarity, and λ is a learned temperature parameter.The aim of CoOp [28] is to address inefficiencies in prompt engineering for enhancing the adaptation of pre-trained vision-language models to downstream applications.Its core concept involves modeling each context token with a continuous vector that can be end-

Learnable Prompt Vector Generation 2.3.1. Definition of Prompt Learning
Prompt learning is a machine learning paradigm aimed at improving the performance and generalization ability of models by incorporating external prompt information to guide their learning process.These prompts can be specially designed vectors or automatically extracted relevant information from data to assist models in better understanding and processing input data.Prompt learning is widely applied in fields such as computer vision, natural language processing, and reinforcement learning, providing additional context and semantic information to models to tackle complex real-world problems.
The pioneering study to introduce prompt learning to the field of computer vision was CLIP [15].During training, CLIP uses a contrastive loss to train a shared embedding space for both images and text.In a minibatch of image-text pairs, CLIP maximizes the cosine similarity between each image and its corresponding text while minimizing similarities with all other unmatched texts.This process is repeated for each text as well.Following training, CLIP enables zero-shot image recognition.Let x represent the image features from the image encoder, and {w i } N i=1 denote a set of weight vectors from the text encoder, each corresponding to a category (assuming there are N categories in total).These weight vectors are derived from prompts, such as "a photo of a {class}", where the "{class}" token is replaced with the i-th class name.Subsequently, the prediction probability is calculated.
Sim() denotes cosine similarity, and λ is a learned temperature parameter.The aim of CoOp [28] is to address inefficiencies in prompt engineering for enhancing the adaptation of pre-trained vision-language models to downstream applications.Its core concept involves modeling each context token with a continuous vector that can be end-to-end learned from data.Specifically, instead of employing "a photo of a" as the context, CoOp introduces M learnable context vectors, {u 1 , u 2 , u 3 , . .., u M }, each possessing the same dimensionality as the word embeddings.The prompt for the i-th class, denoted by C i , is then represented as C i = {u 1 , u 2 , u 3 , . .., u M , c i }, where c i stands for the word embedding(s) for the class name.These context vectors are shared across all classes.To fine-tune CLIP for a downstream image recognition dataset, a cross-entropy loss is typically employed as the learning objective.Given that the text encoder E() is differentiable, gradients can be backpropagated to update the context vectors accordingly.Let E() denote the text encoded.Then, the prediction probability is as follows:

Offline Module of Learnable Prompt Vectors
To address the jet trajectory image detection problem, we employ the offline module to learn prompt vectors (OMPV), which enhances context vectors with instance-conditional context for improved generalization.The offline module can be trained before the detection and does not take up memory and runtime during the detection.The specific structure is shown in Figure 6 dimensionality as the word embeddings.The prompt for the i-th class, denoted by Ci, is then represented as Ci = {u1, u2, u3, …, uM, ci}, where ci stands for the word embedding(s) for the class name.These context vectors are shared across all classes.To fine-tune CLIP for a downstream image recognition dataset, a cross-entropy loss is typically employed as the learning objective.Given that the text encoder E() is differentiable, gradients can be backpropagated to update the context vectors accordingly.Let E() denote the text encoded.Then, the prediction probability is as follows: Figure 7 illustrates the architecture of the Meta-Net employed in this study, characterized by a two-layer bottleneck structure comprising linear, ReLU, and linear . . ,M}.The prompt for the i-th class is thus conditioned on the input, that is, C i (x) = {u 1 (x), u 2 (x), u 2 (x), ..., u M (x), c i }.The prediction probability is computed as follows: Figure 7 illustrates the architecture of the Meta-Net employed in this study, characterized by a two-layer bottleneck structure comprising linear, ReLU, and linear layers.The hidden layer reduces the input dimensionality by a factor of 16×.The Meta-Net receives, as input, the output features generated by the image encoder, facilitating further processing and refinement of the encoded information.
layers.The hidden layer reduces the input dimensionality by a factor of 16×.The Meta-Net receives, as input, the output features generated by the image encoder, facilitating further processing and refinement of the encoded information.Meta-Net, "Linear" refers to a layer that performs a linear transformation, typically involving matrix multiplication and bias addition."ReLU" stands for Rectified Linear Unit, which is a popular non-linear activation function that outputs the input value if it is positive and zero otherwise.Our methodology builds upon the integration of prompt vectors v and classification branches f within the CLIP framework, leveraging the cross-entropy loss function LCE for effective classification.The core objective is to enhance the model's ability to classify images by incorporating semantically meaningful prompts alongside traditional classification mechanisms.Meta-Net, "Linear" refers to a layer that performs a linear transformation, typically involving matrix multiplication and bias addition."ReLU" stands for Rectified Linear Unit, which is a popular non-linear activation function that outputs the input value if it is positive and zero otherwise.layers.The hidden layer reduces the input dimensionality by a factor of 16×.The Meta-Net receives, as input, the output features generated by the image encoder, facilitating further processing and refinement of the encoded information.Meta-Net, "Linear" refers to a layer that performs a linear transformation, typically involving matrix multiplication and bias addition."ReLU" stands for Rectified Linear Unit, which is a popular non-linear activation function that outputs the input value if it is positive and zero otherwise.Our methodology builds upon the integration of prompt vectors v and classification branches f within the CLIP framework, leveraging the cross-entropy loss function LCE for effective classification.The core objective is to enhance the model's ability to classify images by incorporating semantically meaningful prompts alongside traditional classification mechanisms.Our methodology builds upon the integration of prompt vectors v and classification branches f within the CLIP framework, leveraging the cross-entropy loss function L CE for effective classification.The core objective is to enhance the model's ability to classify images by incorporating semantically meaningful prompts alongside traditional classification mechanisms.

Object Detection Model
First, we integrate prompt vectors derived from the LPV with the classification branches.This integration is crucial as it allows the model to leverage additional semantic information provided by the prompts during classification.The prompt vectors serve as supplemental cues that guide the classification process towards more accurate and contex-tually relevant predictions.To facilitate the learning process, we used the cross-entropy loss function L CE as the optimization objective during training.
v cls i is the vector of the cls head, and α is a hyperparameter.This loss function quantifies the disparity between the predicted and actual class labels, driving the model to minimize classification errors and improve overall performance.By aligning the model's predictions with ground truth labels, the cross-entropy loss fosters the development of robust classification capabilities.

Experiment
In order to verify the proposed PCV method, an experimental facility was constructed at the Ling Tian Co., Ltd. in Xuzhou City, China.Comprehensive Test Site, and experiment results were analyzed to prove the reliability of the proposed characterization methods.The experimental system and experiment result analysis are introduced in the next sections.

Experiment Setup
The experiment was conducted using a comprehensive experimental platform consisting primarily of a UAV system and a firefighting robot, as shown in Figure 9. Auxiliary equipment included control boxes, water cannons, and other supporting devices.The firefighting robot communicated with the UAV through control boxes to enable remote control and manipulation.During the experiment, the UAV was positioned at different distances of water cannon from target points to capture variations in the jet trajectory from different aerial perspectives.Figure 9 vividly illustrates the experimental platform used in this study, showcasing the integration of these devices, which provided reliable technical support for jet trajectory detection.We conducted two comparative experiments, comparing the detection results on the jet trajectory dataset with Faster R-CNN and ATSS, respectively.Then, we conducted five ablation experiments to determine the value of hyperparameter α.
First, we integrate prompt vectors derived from the LPV with the classification branches.This integration is crucial as it allows the model to leverage additional semantic information provided by the prompts during classification.The prompt vectors serve as supplemental cues that guide the classification process towards more accurate and contextually relevant predictions.To facilitate the learning process, we used the crossentropy loss function LCE as the optimization objective during training.

log( )
is the vector of the cls head, and α is a hyperparameter.This loss function quantifies the disparity between the predicted and actual class labels, driving the model to minimize classification errors and improve overall performance.By aligning the model's predictions with ground truth labels, the cross-entropy loss fosters the development of robust classification capabilities.

Experiment
In order to verify the proposed PCV method, an experimental facility was constructed at the Ling Tian Co., Ltd. in Xuzhou City, China.Comprehensive Test Site, and experiment results were analyzed to prove the reliability of the proposed characterization methods.The experimental system and experiment result analysis are introduced in the next sections.

Experiment Setup
The experiment was conducted using a comprehensive experimental platform consisting primarily of a UAV system and a firefighting robot, as shown in Figure 9. Auxiliary equipment included control boxes, water cannons, and other supporting devices.The firefighting robot communicated with the UAV through control boxes to enable remote control and manipulation.During the experiment, the UAV was positioned at different distances of water cannon from target points to capture variations in the jet trajectory from different aerial perspectives.Figure 9 vividly illustrates the experimental platform used in this study, showcasing the integration of these devices, which provided reliable technical support for jet trajectory detection.We conducted two comparative experiments, comparing the detection results on the jet trajectory dataset with Faster R-CNN and ATSS, respectively.Then, we conducted five ablation experiments to determine the value of hyperparameter α.

Parameterization Setup
We collected 50 images at distances of 30 m, 40 m, 50 m, and 60 m from the target point of the fire cracker, respectively, forming a dataset containing 200 jet trajectory images.Each image size was 1280 × 720 pixels.
We used the mmdetection framework, which is a convenient and efficient open-source framework widely used in object detection tasks.The object detection network is based on Faster R-CNN with ResNet-50 as the backbone.We employed stochastic gradient descent (SGD) as the optimizer, with a batch size set to 2. The learning rate was set to 0.02.All experiments were conducted on a single RTX 3060 GPU, with training epochs set to 12.The hyperparameter α is set to 1.
Image preprocessing: in terms of image processing, a series of measures were implemented to enhance image quality and accuracy.First, random flipping was applied to increase sample diversity.In our experiments, we set the probability of random flipping during image augmentation to 0.5.This choice is based on the principle of introducing variability while maintaining balance.By setting the probability to 0.5, we ensure that each image has an equal chance of being flipped or not flipped during augmentation.This approach helps to introduce diversity in the dataset, which can improve the robustness and generalization of the model.Additionally, flipping images is a common data augmentation technique used in computer vision tasks to simulate different viewpoints and orientations, thereby enhancing the model's ability to learn invariant features.Second, geometric corrections, including scaling and padding, were performed to rectify geometric distortions in the images.We resized the images using the Resize operation, setting the width to 1000 pixels and the height to 600 pixels.This choice was driven by both experimental requirements and the input size constraints of our model.Setting "keep_ratio = True" ensured that the aspect ratio of the images remained unchanged, preventing distortion and preserving image accuracy and integrity.Subsequently, we applied the Pad operation to ensure that the image dimensions were divisible by 32.This adjustment was made to meet the input size requirements of our chosen model.Certain convolutional neural network (CNN) architectures necessitate input images with dimensions that are multiples of a specific value.Therefore, padding the images ensured compatibility with the model architecture and improved the efficiency of model training.Lastly, photometric adjustments were conducted to adjust exposure and color balance, thereby restoring the true luminance information of the images and improving the visualization of the jet trajectory.The chosen parameter values are integral to the photometric adjustment process, where they dictate the extent of changes applied to image attributes.In our experiment, setting brightness_delta to 32 allows for a considerable range of brightness adjustments, ensuring a wide spectrum of luminance variations in the augmented dataset.The contrast_range parameter, spanning from 0.5 to 1.5, facilitates both diminishing and amplifying contrast levels, enriching the dataset with diverse contrast variations.Similarly, saturation_range (0.5, 1.5) controls the saturation adjustments, enabling the augmentation of images with varying levels of color intensity.The Hue_delta set at 18 governs the range of hue shifts, introducing subtle variations in color tones across the dataset.These processing steps rendered the images more accurate and reliable, laying a solid foundation for subsequent data analysis and algorithmic applications.
Prediction model: to address the prediction of jet trajectory, we adopted a computer vision approach based on prompt learning.The experimental setup features a Faster R-CNN model configured with a ResNet-50 backbone pre-trained on ImageNet.A Feature Pyramid Network (FPN) with five output stages enhances multi-scale feature extraction using anchor boxes with scales of [8] and ratios of [0.5, 1.0, 2.0].The Region Proposal Network (RPN) generates object proposals with IoU thresholds (pos_iou_thr = 0.7, neg_iou_thr = 0.3), while RoI sampling ensures balanced selection (pos_fraction = 0.25).During testing, non-maximum suppression is applied to control the number of detections per image (max_per_img = 1000).These parameter configurations aim to optimize object detection performance by balancing computational efficiency and accuracy.This method not only considers the inherent features of the images but also leverages textual prompts to overcome the challenges posed by the limited diversity of jet trajectory scenes and feature scarcity.By incorporating visual and textual multimodal features, we were able to provide a more comprehensive description of jet trajectory characteristics, thereby enhancing the accuracy and robustness of the prediction model.

Experiment Results
The jet trajectory detection network based on prompt-driven computer vision includes key steps such as random flipping, geometric and photometric correction, and learnable prompt vectors.In this study, we conducted a comparative analysis between the improved jet trajectory detection network and traditional methods in both fire detection and jet trajectory detection.To clearly demonstrate the improvements, we extensively compared the prediction results of the new model with Faster R-CNN.During the experiments, we collected 50 jet trajectory images at different distances of the firecracker from the target point (30 m, 40 m, 50 m, and 60 m) using a UAV, resulting in a dataset of 200 annotated images.The experimental results are summarized in Table 1, and the visual detection outcomes are presented in Figure 10: not only considers the inherent features of the images but also leverages textual prompts to overcome the challenges posed by the limited diversity of jet trajectory scenes and feature scarcity.By incorporating visual and textual multimodal features, we were able to provide a more comprehensive description of jet trajectory characteristics, thereby enhancing the accuracy and robustness of the prediction model.

Experiment Results
The jet trajectory detection network based on prompt-driven computer vision includes key steps such as random flipping, geometric and photometric correction, and learnable prompt vectors.In this study, we conducted a comparative analysis between the improved jet trajectory detection network and traditional methods in both fire detection and jet trajectory detection.To clearly demonstrate the improvements, we extensively compared the prediction results of the new model with Faster R-CNN.During the experiments, we collected 50 jet trajectory images at different distances of the firecracker from the target point (30 m, 40 m, 50 m, and 60 m) using a UAV, resulting in a dataset of 200 annotated images.The experimental results are summarized in Table 1, and the visual detection outcomes are presented in Figure 10:  In the experimental section, we evaluated the performance of the model using two metrics: recall and mean average precision (mAP).Recall measures the proportion of correctly identified positive samples out of the total positive samples, serving as an indicator of the model's comprehensiveness in detecting targets.It is calculated as follows: Here, TP (true positive) represents the number of correctly identified positive samples, while FN (false negative) represents the number of positive samples that were missed by the model.Mean average precision (mAP) is a widely used performance metric for object detection tasks that considers both precision and recall across different classes.It is calculated by computing the average precision for each class and then taking the harmonic mean of these values.The formula is as follows: Here, N is the number of classes, and AP i is the average precision for class i.These evaluation metrics provide comprehensive insights into the performance of the model at different altitudes, enabling comparisons and optimizations.
From Table 1, when we compare the detection results of our model with other models, it can be observed that the improved model achieved a recall of 0.98, which is 0.25 higher than the Faster R-CNN, and an accuracy of 95.1%, which is 25.4 higher than the Faster R-CNN.Compared with ATSS, our model has a more impressive AP in jet trajectory detection, which is 7.3 higher than ATSS.
Table 2 presents the ablation experiments for each module.Image Enhancement means using the image enhancement when network training.OMPV denotes the offline module of learnable prompt vectors.From Table 2, it can be observed that with the introduction of image enhancement, the recall increased from 0.73 to 0.89 and the accuracy increased from 69.7 to 84.0.The recall increased by 0.16, and the accuracy increased by 14.3.Moreover, the introduction of OMPV further improved the recall to 0.98 and the mean average precision (mAP) to 95.1.With a significant raise already present, the mAP increased by an additional 11.1.This indicates that the introduction of image enhancement increased the probability of Faster R-CNN detection of jet trajectory, while prompts improved the accuracy of jet trajectory classification.In Table 3, we set five different values of the hyperparameter α.As the results in Table 3 show, the detection result reaches its best when α is set to 1.0.

Discussion
To investigate the specific effects of Image Enhancement, we conducted a statistical analysis on all annotated boxes (gts) and detected boxes (dets) in the dataset.The experimental results.Figure 11 indicates that after data augmentation, the blue denotes the Faster R-CNN, and the orange denotes the use of image enhancement when training.Yellow indicates using image enhancement and OMPV when training.The number of annotated boxes (gts) remained at 286, while the number of detected boxes increased to 337 after applying Image Enhancement, compared to 315 detected boxes in Faster R-CNN training.The right part of Figure 11 represents the number of predicted anchors.Because of the introduction of prompt learning, the probability of correctly predicted anchors will be higher, and the probability of less effective anchors will be reduced.These less effective anchors will be removed after filtering through the threshold, which results in fewer anchors predicted by Enhancement + Prompt, but the quality will be higher.This demonstrates that Image Enhancement significantly improves the network's detection capability for targets.

Discussion
To investigate the specific effects of Image Enhancement, we conducted a statistical analysis on all annotated boxes (gts) and detected boxes (dets) in the dataset.The experimental results.Figure 11 indicates that after data augmentation, the blue denotes the Faster R-CNN, and the orange denotes the use of image enhancement when training.Yellow indicates using image enhancement and OMPV when training.The number of annotated boxes (gts) remained at 286, while the number of detected boxes increased to 337 after applying Image Enhancement, compared to 315 detected boxes in Faster R-CNN training.The right part of Figure 11 represents the number of predicted anchors.Because of the introduction of prompt learning, the probability of correctly predicted anchors will be higher, and the probability of less effective anchors will be reduced.These less effective anchors will be removed after filtering through the threshold, which results in fewer anchors predicted by Enhancement + Prompt, but the quality will be higher.This demonstrates that Image Enhancement significantly improves the network's detection capability for targets.12(a2,b2,c2), we showcase the detection results achieved after applying Image Enhancement techniques.This comparison visually demonstrates the effectiveness of Image Enhancement in improving the accuracy and reliability of the detection results.The Faster R-CNN exhibits issues such as missed detections (Figure 12  To validate the effectiveness of OMPV, we visualized the scores of the detected boxes. Figure 13 compares the class scores between the Faster R-CNN and PCV detection results.In subfigures a1 and b1, we present the detection results obtained from the Faster R-CNN.Subsequently, in subfigures Figure 13(a2,b2), we showcase the detection results achieved by the proposed approach.This comparison provides a visual representation of the performance improvement achieved by the proposed method in terms of class score accuracy.From the figure, it can be observed that, under the same detected jet trajectory, the classification scores after introducing prompts (Figure 13(a2,b2) are 1.00) are higher than Faster R-CNN ones (Figure 13(a1) is 0.68 and Figure 13(b1) is 0.98).This indicates that, in the case of jet trajectory images with singular features, prompts can learn more effective features by introducing multimodal textual cues, thereby addressing the issue of inaccurate classification.
Subsequently, in subfigures Figure 13(a2,b2), we showcase the detection results achieved by the proposed approach.This comparison provides a visual representation of the performance improvement achieved by the proposed method in terms of class score accuracy.
From the figure, it can be observed that, under the same detected jet trajectory, the classification scores after introducing prompts (Figure 13(a2,b2) are 1.00) are higher than R-CNN ones (Figure 13(a1) is 0.68 and Figure 13(b1) is 0.98).This indicates that, in the case of jet trajectory images with singular features, prompts can learn more effective features by introducing multimodal textual cues, thereby addressing the issue of inaccurate classification.

Conclusions
The presented novel method for monitoring jet trajectory during the jetting process, integrating UAV imagery with a learnable prompt vector module, has significantly enhanced detection capability and improved monitoring accuracy and stability, achieving a remarkable recall rate of 0.98 and a mean Average Precision (mAP) of 95.1%.Experimental results demonstrate the effectiveness of Image Enhancement techniques in mitigating geometric and photometric distortions, leading to more reliable trajectory monitoring in diverse environmental conditions.The high precision and efficiency achieved hold promise for practical applications in firefighting systems and industrial processes, while future research may explore further optimization and extension of the proposed method to other dynamic monitoring tasks beyond jet trajectory detection.
Future research directions: Data Diversity and Generalization: Despite the effectiveness of methods such as Image Enhancement and prompt vectors in improving jet trajectory detection, there still exists an issue regarding the diversity and generalization of the dataset.In real-world applications, jet trajectory may be influenced by various environmental conditions, viewing angles, and lighting conditions.Therefore, it is essential for the model to generalize well across diverse scenarios.
Model Interpretability: The interpretability of jet trajectory detection models is crucial in certain critical application scenarios.Users may require insights into the decisionmaking process of the model to facilitate further analysis and decision-making.

Figure 1 .
Figure 1.The UAV captures jet trajectory images from a top-down perspective using visual sensors.

Figure 1 .
Figure 1.The UAV captures jet trajectory images from a top-down perspective using visual sensors.

Sensors 2024 ,Figure 2 .
Figure 2. The orange and purple lines represent the labeled boxes, and the green dots represent the four corner points of the boxes.(a) is the original image, and (b) is the image after the label.

Figure 2 .
Figure 2. The orange and purple lines represent the labeled boxes, and the green dots represent the four corner points of the boxes.(a) is the original image, and (b) is the image after the label.

Figure 4
provides a visual representation of the image-flipping process.The top row displays the original images, while the bottom row exhibits the images after undergoing the flipping transformation.
. The jet trajectory images are encoded to generate visual feature vectors I, which are then inputted into the Meta-Net.The Meta-Net combines these visual feature vectors with token representations of category text, and the concatenated vectors are then fed into the CLIP Text encode to generate text vectors T. Finally, a contrastive loss is computed between the visual feature vectors I and the text vectors T.

2 .
Offline Module of Learnable Prompt VectorsTo address the jet trajectory image detection problem, we employ the offline module to learn prompt vectors (OMPV), which enhances context vectors with instanceconditional context for improved generalization.The offline module can be trained before the detection and does not take up memory and runtime during the detection.The specific structure is shown in Figure6.The jet trajectory images are encoded to generate visual feature vectors I, which are then inputted into the Meta-Net.The Meta-Net combines these visual feature vectors with token representations of category text, and the concatenated vectors are then fed into the CLIP Text encode to generate text vectors T. Finally, a contrastive loss is computed between the visual feature vectors I and the text vectors T.

Figure 6 .
Figure 6.Offline module of learnable prompt vectors.Using φθ() to denote the Meta-Net parameterized by θ, let I represent the image after passing through the encoder.Subsequently, I is fed into φθ() to obtain I*.Next, I* is added to the text token um and the context token is now obtained by um (x) = um + I*, where I* = φθ(I) and  ∈ {1, 2, … , }.The prompt for the i-th class is thus conditioned on the input, that is,   = {  ,   ,   , . . .,   ,  }.The prediction probability is computed as follows:

Figure 6 .
Figure 6.Offline module of learnable prompt vectors.Using φ θ () to denote the Meta-Net parameterized by θ, let I represent the image after passing through the encoder.Subsequently, I is fed into φ θ () to obtain I*.Next, I* is added to the text token u m and the context token is now obtained by u m (x) = u m + I*, where I* = φ θ (I) and m ∈ {1, 2, .. . ,M}.The prompt for the i-th class is thus conditioned on the input, that is, C i (x) = {u 1 (x), u 2 (x), u 2 (x), ..., u M (x), c i }.The prediction probability is computed as follows:

Figure 7 .
Figure7.Meta-Net, "Linear" refers to a layer that performs a linear transformation, typically involving matrix multiplication and bias addition."ReLU" stands for Rectified Linear Unit, which is a popular non-linear activation function that outputs the input value if it is positive and zero otherwise.

Figure 8
Figure 8 depicts the architecture of the object detection model used in this study.The diagram illustrates the sequential stages of the model's operation.Initially, jet trajectory images undergo data augmentation to enrich the dataset.Subsequently, feature extraction is performed to capture visual characteristics from the augmented images.Following this, proposals are generated using feature pyramids and RoI (Region of Interest) pooling techniques.The RoI Feature Extractor then processes these proposals to extract relevant features.Finally, classification and regression heads are applied, with the classification head incorporating prompts to optimize classification efficiency.

Figure 8 .
Figure 8. Object detection model.The diagram illustrates the sequential stages of the model's operation: data augmentation, feature extraction, proposal generation, RoI Feature Extractor, and classification/regression heads.The RoI Feature Extractor processes proposals to extract relevant features, while the classification head incorporates prompts to optimize classification efficiency.The prompts are a supervision of classifications.

Figure 7 .
Figure 7.Meta-Net, "Linear" refers to a layer that performs a linear transformation, typically involving matrix multiplication and bias addition."ReLU" stands for Rectified Linear Unit, which is a popular non-linear activation function that outputs the input value if it is positive and zero otherwise.

Figure 8
Figure 8 depicts the architecture of the object detection model used in this study.The diagram illustrates the sequential stages of the model's operation.Initially, jet trajectory images undergo data augmentation to enrich the dataset.Subsequently, feature extraction is performed to capture visual characteristics from the augmented images.Following this, proposals are generated using feature pyramids and RoI (Region of Interest) pooling techniques.The RoI Feature Extractor then processes these proposals to extract relevant features.Finally, classification and regression heads are applied, with the classification head incorporating prompts to optimize classification efficiency.

Figure 7 .
Figure 7.Meta-Net, "Linear" refers to a layer that performs a linear transformation, typically involving matrix multiplication and bias addition."ReLU" stands for Rectified Linear Unit, which is a popular non-linear activation function that outputs the input value if it is positive and zero otherwise.

Figure 8
Figure 8 depicts the architecture of the object detection model used in this study.The diagram illustrates the sequential stages of the model's operation.Initially, jet trajectory images undergo data augmentation to enrich the dataset.Subsequently, feature extraction is performed to capture visual characteristics from the augmented images.Following this, proposals are generated using feature pyramids and RoI (Region of Interest) pooling techniques.The RoI Feature Extractor then processes these proposals to extract relevant features.Finally, classification and regression heads are applied, with the classification head incorporating prompts to optimize classification efficiency.

Figure 8 .
Figure 8. Object detection model.The diagram illustrates the sequential stages of the model's operation: data augmentation, feature extraction, proposal generation, RoI Feature Extractor, and classification/regression heads.The RoI Feature Extractor processes proposals to extract relevant features, while the classification head incorporates prompts to optimize classification efficiency.The prompts are a supervision of classifications.

Figure 8 .
Figure 8. Object detection model.The diagram illustrates the sequential stages of the model's operation: data augmentation, feature extraction, proposal generation, RoI Feature Extractor, and classification/regression heads.The RoI Feature Extractor processes proposals to extract relevant features, while the classification head incorporates prompts to optimize classification efficiency.The prompts are a supervision of classifications.

Figure 9 .
Figure 9. Structural diagram of the experimental platform.

Figure 10 .Figure 10 .
Figure 10.The visualization of jet trajectory detection, the red lines represent the predicted boxes.The visualization illustrates the visual detection outcomes of jet trajectory images collected at vari-Figure 10.The visualization of jet trajectory detection, the red lines represent the predicted boxes.The visualization illustrates the visual detection outcomes of jet trajectory images collected at various distances (30 m, 40 m, 50 m, and 60 m) from the target point using a UAV during the experiments.The subfigures (a1-a4) depict the results at 30 m, (b1-b4) at 40 m, (c1-c4) at 50 m, and (d1-d4) at 60 m.

Figure 11 .
Figure 11.Ablation experiment of gts and dets.gts denotes annotated boxes, and dets denotes detected boxes.Image Enhancement means using the image enhancement when network training.Image Enhancement + Prompt means that we use the image enhancement and offline module of learnable prompt vectors when network training.

Figure 11 .
Figure 11.Ablation experiment of gts and dets.gts denotes annotated boxes, and dets denotes detected boxes.Image Enhancement means using the image enhancement when network training.Image Enhancement + Prompt means that we use the image enhancement and offline module of learnable prompt vectors when network training.

Figure 12
Figure 12 illustrates the comparison between the detection results obtained from the Faster R-CNN and the improved approach.In subfigures Figure 12(a1,b1,c1), we present the detection results obtained from the Faster R-CNN.Subsequently, in subfigures Figure 12(a2,b2,c2), we showcase the detection results achieved after applying Image Enhancement techniques.This comparison visually demonstrates the effectiveness of Image Enhancement in improving the accuracy and reliability of the detection results.The Faster R-CNN exhibits issues such as missed detections (Figure 12(a1)), inaccurate bounding boxes (Figure 12(b1)), and false detections (Figure 12(c1)).However, the improved method effectively addresses these issues caused by geometric and photometric distortions (Figure 12(a2)), (Figure 12(b2)), and (Figure 12(c2)).This further underscores the

Figure 12 .
Figure 12.The detection result between Faster R-CNN and improved, the red lines represent the predicted boxes.(a1,b1,c1) present the detection outcomes obtained from the Faster R-CNN, while subfigures (a2,b2,c2) showcase the detection results achieved after applying Image Enhancement techniques.

Figure 12 .
Figure 12.The detection result between Faster R-CNN and improved, the red lines represent the predicted boxes.(a1,b1,c1) present the detection outcomes obtained from the Faster R-CNN, while subfigures (a2,b2,c2) showcase the detection results achieved after applying Image Enhancement techniques.

Figure 13 .
Figure 13.The class score between Faster R-CNN and improved, the red lines represent the predicted boxes.(a1,b1) display the detection outcomes obtained from the Faster R-CNN, while subfigures (a2,b2) present the detection results achieved by PCV.

Table 1 .
Comparison between Faster R-CNN and PCV.Faster R-CNN denotes the origin method, while PCV denotes the proposed method.Backbone denotes the backbone network used in model training.

Table 2 .
Ablation of Image Enhancement and OMPV, " √ " indicates that the corresponding module was adopted or added during the experiment.Image Enhancement means using the image enhancement when network training.OMPV denotes the offline module of learnable prompt vectors.

Table 3 .
Comparison results (%) to evaluate the effectiveness of different αs on the dataset.

Table 3 .
Comparison results (%) to evaluate the effectiveness of different αs on the dataset.