A Ship Tracking and Speed Extraction Framework in Hazy Weather Based on Deep Learning

: Obtaining ship navigation information from maritime videos can signiﬁcantly improve maritime supervision efﬁciency and enable timely safety warnings. Ship detection and tracking are essential technologies for mining video information. However, current research focused on these advanced vision tasks in maritime supervision is not sufﬁciently comprehensive. Taking into account the application of ship detection and tracking technology, this study proposes a deep learning-based ship speed extraction framework under the haze environment. First, a lightweight convolutional neural network (CNN) is used to remove haze from images. Second, the YOLOv5 algorithm is used to detect ships in dehazed marine images


Introduction
Currently, the Automatic Identification System (AIS) serves as the primary platform for exchanging navigation information, including ship speed, between ships and between ships and the shore [1].However, the rapid growth of the shipping industry has led to an increased number of ships, resulting in AIS signal interference in busy waters.Meanwhile, the system's weak ability to combat data defects, system instability, and environmental interference often causes data delays or losses [2].Additionally, some ships either lack AIS equipment or turn it off in monitored waters, thereby preventing the maritime supervision department from obtaining timely navigation information [3].In this situation, both the supervisory authority and ships in the same waters are unable to obtain accurate and timely speed information of other ships, posing a hidden danger to navigation safety.Maritime videos, which provide rich information at a low cost, are widely used in maritime supervision.Techniques such as image processing, target detection, and target tracking are employed to identify obstacles at sea [4] and extract navigation information, such as ship trajectories, from maritime images [5].These approaches have positive implications for enhancing maritime supervision efficiency and ensuring ship safety.
While several studies have been conducted on ship detection and tracking in maritime images [5][6][7], the related research has not sufficiently explored the application of ship detection and tracking technology, nor has it fully extracted the navigation information

•
To address the issue of the image becoming dark after haze removal, thereby obscuring the ship's target features, we improved AOD-Net [9] at the pixel level.After haze removal, the mean peak signal-to-noise ratio (PSNR) of multiple maritime scenes reached 23.86, and the mean structural similarity index (SSIM) was 0.96, thus improving the quality of maritime images.

•
To address the issue of the image becoming dark after haze removal, thereby obscuring the ship's target features, we improved AOD-Net [9] at the pixel level.After haze removal, the mean peak signal-to-noise ratio (PSNR) of multiple maritime scenes reached 23.86, and the mean structural similarity index (SSIM) was 0.96, thus improving the quality of maritime images.

•
We extract ship speed from the images based on the image mapping relationship.The average accuracy of ship speed extraction using this framework across multiple scenes is approximately 95%.Furthermore, the mean squared error (MSE) of the speed values extracted from the dehazed images is approximately 0.3 Kn lower than that extracted from the images before haze removal.

•
Provides ideas for the application of advanced vision tasks such as haze removal from maritime haze images and ship tracking in maritime scenarios, improving the efficiency of maritime supervision.

Related Work 2.1. Image Haze Removal
Image processing for hazy weather is a significant research direction in the field of computer vision.Haze removal methods based on image enhancement primarily aim to enhance image contrast and highlight image details.These methods include the adaptive histogram equalization method [10], Retinex theory [11], etc.While these methods are simple, easy to implement, and widely applicable, they may lead to loss of details or overenhancement.He et al. proposed a method of combining the dark channel prior with the atmospheric scattering model for haze removal [12].The experimental principle of this method is simple and has a good effect on most natural scenes, but it is prone to local coloration or image brightness reduction after removing haze.In recent years, deep learning methods, such as convolutional neural networks (CNN), have been utilized for haze removal, and numerous deep networks have been developed for this purpose [9,[13][14][15]; these haze removal networks have demonstrated improved results in haze removal experiments.However, most of these methods have been applied to land-based scenes, and there is a need for an improved haze removal network specifically tailored for maritime haze videos, considering the differences in sea surface scattering and other imaging characteristics compared to land-based haze.Therefore, this study enhances AOD-Net for maritime haze scenes to more efficiently remove haze from maritime haze images.

Target Tracking
Current methods of multi-target tracking generally employ the TBD (Tracking-by-Detection) strategy, which involves first detecting the target's position in the image and then establishing associations between frames based on appearance consistency or positional similarity of the same target across frames.In recent years, with the advancement of algorithms such as deep learning, tracking accuracy has been enhanced by utilizing techniques such as neural networks to learn the appearance information of targets across different video frames for precise inter-frame associations [16][17][18].However, the study by [19] demonstrates that when different tracking targets share similar appearance features, matching errors in target IDs can occur, making reliance solely on appearance features for inter-frame association unreliable.
Location similarity-based target tracking methods can overcome issues arising from the appearance similarity of tracked targets.Simple online and real-time tracking (SORT) [20] performs data association based on positional similarity and first uses a Kalman filter to predict the position of the track in the next frame and then calculates the Intersection over Union (IoU) between the detected and predicted frames.ByteTrack [21] matches the frames with IoU matching below the threshold twice to improve the tracking performance of the object when it is occluded.Inter-frame matching combining appearance consistency and location similarity can be sufficient to further improve tracking performance [22][23][24], and Deep SORT [25] uses an independent Re-ID model to extract appearance features from the detected frames to reduce ID matching errors.It is worth stating that the performance of the current multi-target tracking algorithm using the TBD strategy is closely related to the results of the detection model, and the performance of the tracking model can be guaranteed when the detection model reaches high accuracy [20].In this paper, we adopt Deep SORT [25], a flexible and robust tracking model, after ensuring that yolov5 can detect ships with stable and high accuracy in maritime scenes.

Techniques for Obtaining Information on Ship Speed
Commonly used technologies for measuring ship speed include AIS [26], radar [27], lasers [28], and video-based speed measurements [29].The emergence and advancement of the AIS system have provided robust technical support for acquiring ship navigation information [30].However, as the number of ships at sea continues to increase, AIS signals are prone to interference.Other ships can only obtain ship navigation information [31] if the ship has AIS installed and turned on.Ship speed measurements using laser and radar technologies require specialized and costly equipment.In contrast, marine videos contain a wealth of ship navigation information, which can be easily visualized and processed in realtime.Additionally, visual sensors offer a wide monitoring range and are cost-effective [32], making them ideal for applications in complex marine environments with numerous ships and various influencing factors [33].With the advancement of visual sensor technology, speed measurement methods based on videos hold promising prospects.

Remove Haze in Marine Haze Images Using CNN
The first part of the framework is a lightweight CNN, which is used to remove haze from hazy marine images.To improve the quality of marine images in complex scenes, marine haze images were used to train AOD-Net [9], which can achieve end-to-end dehazing in marine scenes.To avoid the darkening of the maritime images after haze removal by AOD-Net and to solve the problem of not highlighting the structural features of ships in the images after haze removal, Equation (1) was introduced to highlight further the structural features of ships in the images.
where, J(x) is the image before image enhancement, G(x) is the image after image enhancement, m is the gain parameter, and n is the bias parameter, which is used to adjust the contrast and brightness of the marine image to further eliminate the impact of background noise.In this study, the AOD-Net model trained by marine images and can improve the quality of images after haze removal is called e-AOD-Net.The e-AOD-Net uses a CNN to remove haze based on the atmospheric scattering model.The traditional atmospheric scattering model that generates hazy images is described as follows: In the Equation (2), I(x) is the hazy image, J(x) is the clear image before image enhancement, and t(x) is the medium transmission, describing the light without scattering and transmitted to the visual sensor, which can be expressed by the atmospheric scattering coefficient β and the distance between the field and the visual sensor d(x).
Equations ( 2) and ( 3) can be transformed into: where, b is the deviation value whose default value is 1.Meanwhile, Equation (4) integrates As shown in Figure 2, I(x) is entered into the network to estimate K(x) and then input K(x) into the dehazed image generation module as an adaptive parameter.The function of the K(x) estimation module is to estimate the depth and haze concentration of hazy images.At the same time, Equation ( 1) is used to reduce the impact of noise on an image.Finally, clear images are synthesized by the multiplication and additional layers, and they can be output directly after haze removal to realize end-to-end haze removal of images.
grates ( ) t x and A into ( ) . The e-AOD-Net builds an adaptive depth estimation model based on the physical model of atmospheric scattering and trains the network by minimizing the error between the pixel values of clear and hazy images.
As shown in Figure 2, ( )   I x is entered into the network to estimate ( )  1) is used to reduce the impact of noise on an image.Finally, clear images are synthesized by the multiplication and additional layers, and they can be output directly after haze removal to realize end-to-end haze removal of images.

Marine Shipping Target Detection Using the YOLOv5 Algorithm
The second part of the framework involves shipping detection.YOLO algorithms are representative of one-stage target detection algorithms, which regard target detection as a regression problem with a simple network flow.Among them, the YOLOv5 network is small, stable, and good in terms of network generalization ability [34], making it an ideal choice for real-time flexible target detection in the offshore environment.Because ships in maritime images are usually small and ship speed extraction has high requirements for the computational speed and flexibility of the detection algorithm, this framework adopts YOLOv5 as the target detection algorithm [35].Meanwhile, the maritime dataset is used to train the YOLOv5 network to realize the fast and accurate positioning of small and medium-sized ships in maritime images and improve the accuracy of ship speed extraction.The main components of the YOLOv5 network are input, backbone, neck, and prediction, as shown in Figure 3.

Marine Shipping Target Detection Using the YOLOv5 Algorithm
The second part of the framework involves shipping detection.YOLO algorithms are representative of one-stage target detection algorithms, which regard target detection as a regression problem with a simple network flow.Among them, the YOLOv5 network is small, stable, and good in terms of network generalization ability [34], making it an ideal choice for real-time flexible target detection in the offshore environment.Because ships in maritime images are usually small and ship speed extraction has high requirements for the computational speed and flexibility of the detection algorithm, this framework adopts YOLOv5 as the target detection algorithm [35].Meanwhile, the maritime dataset is used to train the YOLOv5 network to realize the fast and accurate positioning of small and medium-sized ships in maritime images and improve the accuracy of ship speed extraction.The main components of the YOLOv5 network are input, backbone, neck, and prediction, as shown in Figure 3.Among them, mosaic enhancement is used at the input end of the YOLOv5 network to improve the detection accuracy of small ships.The adaptive anchor frame calculation and adaptive scaling for different data are performed to improve the calculation speed of the network.The neck network integrates the information of the upper and lower layers to fully extract the features of the ship.At the same time, the cross stage partial network (CSP-Net) is used to enhance the fusion of the target features of the network and improve the extraction efficiency of the ship's features.In the prediction part of YOLOv5, the anchor frame of the grid is used for target detection on feature graphs of different scales.The Among them, mosaic enhancement is used at the input end of the YOLOv5 network to improve the detection accuracy of small ships.The adaptive anchor frame calculation and adaptive scaling for different data are performed to improve the calculation speed of the network.The neck network integrates the information of the upper and lower layers to fully extract the features of the ship.At the same time, the cross stage partial network (CSP-Net) is used to enhance the fusion of the target features of the network and improve the extraction efficiency of the ship's features.In the prediction part of YOLOv5, the anchor frame of the grid is used for target detection on feature graphs of different scales.The complete intersection over union (CIoU) is used as the loss function of the boundary frame, which allows the algorithm to converge quickly and make the prediction frame more consistent with the real frame.Non-maximum suppression (NMS) is used by the neck network to enhance the detection accuracy of multiple ship targets and overlapping ship targets.

Ship Target Tracking with Deep SORT Algorithm
To obtain the pixel displacement of the ships in the images of continuous frames, the video with the detection frame information is taken as the input of the Deep SORT algorithm [8] in the third part of this framework.The algorithm first predicts the trajectory of the next frame using the Kalman filter.IOU matching and cascade matching are then performed between the predicted value and the detection frame information to track the trajectory of the target ship between the front and back frames of the video [36].
During the prediction process, h represents the motion state of the target in k − 1 box, where (u, v) is the central point coordinates of the target box, r is the aspect ratio, and h is the height of the detection box.In x k−1 , the last four variables are the derivatives of the first four variables, representing the transformation rate of the first four variables.When the standard filter is used to predict the motion state of the target, the last four values are constants.With x k−1 = (u, v, r, h) as the prediction result, the motion state prediction of the Kalman algorithm can be expressed as: where, xk−1 is the motion state vector of frame k − 1, and A is the state transition matrix used to predict the motion state xk of frame k.
In the process of data association between continuous frames, Mahalanobis distance, and cosine distance are introduced to conduct the association of data between connected frames, and the thresholds between the observation box and prediction box are set, respectively.In the process of data association, when the Mahalanobis distance and cosine distance are both within the threshold range, the data association of two adjacent frames is considered successful.The comprehensive associated cost equation is as follows: where, λ is a hyperparameter, and the influence of Mahalanobis distance and cosine distance on the association results can be controlled by controlling λ.
In order to enable the targets blocked for a long time to be continuously tracked, Deep SORT introduced cascade matching to give priority to the targets with more occurrences.Then, the intersection and union (IoU) between the boundary box and the prediction box by the Kalman filter is calculated and detected for the newly emerged targets and the prediction boxes that failed to match.The detection result is recognized if the match value is greater than the minimum IoU value.The equation for calculating IoU is as follows: where, A is the detection boundary box, and B is the prediction boundary box of the candidate trajectory.

Space Mapping from 2D to 3D and Speed Extraction
In the fourth part, the mapping relation matrix between the ship's displacement in the images and the actual displacement is obtained by 2D to 3D space mapping.The actual ship displacement is calculated according to the trajectory of the ship in the images obtained by the target tracking algorithm.The ship's movement in a short time is regarded as the uniform motion.The average velocity equation is used to estimate the actual velocity of the ships based on the premise of knowing the time difference and actual displacement of the ship [37].
The process of 2D to 3D space mapping involves solving the mapping relationship between objects from the three-dimensional world and points on the two-dimensional image plane.The process involves four coordinate systems, and the representation methods of coordinate systems and points in coordinate systems are as follows: (1) World coordinate system.The coordinate system corresponding to the three-dimensional world describes the position of the target in the real world.The unit length of the coordinate axis is m.The points in the world coordinate system are represented by (X w , Y w , Z w ).
(2) Camera coordinate system.The origin is located in the optical center of the lens, and its x-axes and y-axes are parallel to both sides of the phase plane.The z-axis is perpendicular to the image plane and is the optical axis of the lens.The unit length of the coordinate axis is m.The points in the camera coordinate system are denoted as (X c , Y c , Z c ).
(3) Image coordinate system.The origin is the intersection of the optical axis of the camera and the imaging plane, that is, the midpoint of the imaging plane.The unit length of the coordinate axis is mm.The points in the image coordinate system are represented by (x, y).(4) Pixel coordinate system: the origin is the top-left corner of the imaging plane in pixels.
Points in the pixel coordinate system are represented as (u, v).
The object is transformed from the world coordinate system to the camera coordinate system through translation and rotation, and the transformation equation between the camera coordinate system and the world coordinate system is: where R is a 3 × 3 rotation matrix and t is a 3 × 1 translation vector.
The transformation of the camera coordinate system into the image coordinate system is based on the projection perspective.The connection between P in space and the camera optical center O is OP, and the intersection point p between OP and phase plane is the projection of point P on the image as shown in Figure 4.
According to the projection perspective, the conversion equation of the camera coordinate system and the image coordinate system is: where Z c is the scale factor, and f is the focal length.The pixel coordinate system can coincide with the image coordinate system after translation, as shown in Figure 5.
where is a 3 × 3 rotation matrix and is a 3 × 1 translation vector.The transformation of the camera coordinate system into the image coordin tem is based on the projection perspective.The connection between P in space camera optical center O is OP , and the intersection point p between O phase plane is the projection of point P on the image as shown in Figure 4.According to the projection perspective, the conversion equation of the came dinate system and the image coordinate system is: where c Z is the scale factor, and f is the focal length.The pixel coordinate sys coincide with the image coordinate system after translation, as shown in Figure 5.According to the projection perspective, the conversion equation of the came dinate system and the image coordinate system is: where c Z is the scale factor, and f is the focal length.The pixel coordinate sys coincide with the image coordinate system after translation, as shown in Figure 5.Then, the conversion equation of the image coordinate system and pixel coordinate system is: Represented by the matrix: where d x and d y are the scale factors of the two coordinate systems in the directions of the x-axis and y-axis, and (u 0 , v 0 ) is the coordinate of the origin of the pixel coordinate system.As can be seen from Equations ( 9), ( 10) and ( 12), the conversion equation between the world coordinate system and the pixel coordinate system in which the image is located can be expressed as: In Equation ( 16), the element in K is the configuration parameters of the camera, which is called the camera's internal parameter matrix.The elements in M are called the internal parameter matrix of the camera.
In this paper, the plane where the ship is located is set to the X w O w Y w plane of the world coordinate system, the direction perpendicular to X w O w Y w is the positive direction of the Z w axes, and the camera coordinate system is set to coincide with the world coordinate system.Under this assumption, the conversion equation of the pixel coordinate system and the world coordinate system can be simplified as: In this study, the ship's movement in a very short time is regarded as uniform linear motion.Assuming that the ship displacement in the pixel coordinate system in the period ∆t at the moment T can be expressed as (∆u, ∆v), the transformation relationship between the image displacement and actual displacement is as follows: Then the actual displacement of the ship in the time period ∆t is: According to the average velocity formula, the velocity of the ship at the moment T can be expressed as:

Results
The main contents of Section 4 are the experimental details and results.It should be noted that all experiments were carried out on an Intel I7-11800H@4.6GHz computer with a 6 g memory processor, and the experiments were completed in the Windows10 system using the Pytorch software library.

Experimental Data
In this study, contrast experiments were conducted on each module of the framework, and simulation experiments were conducted to verify the robustness of the framework in the process of extracting the shipping speed.The experimental data of hazy removal included three shore-based surveillance videos, which contained 5658 images, and three maritime videos with 8000 marine images.The resolution of the shore-based surveillance images is 640 × 386, and that of the self-built images is 1920 × 1080.
In the haze removal experiment of marine images, 5000 synthetic hazy marine images were selected as the training dataset to improve the generalization performance of AOD-Net on marine scenes, and 600 images were used as the non-repetitive test set.During the maritime target detection and tracking experiment, 8000 high-resolution images were divided into 6400 training datasets, 1000 validation datasets, and 600 test datasets to train the YOLOv5 algorithm.It should be noted that the training and test datasets do not overlap.At the same time, the Deep SORT algorithm was selected to combine with the YOLOv5 algorithm to conduct the multi-ship tracking experiment and the simulation experiment of ship speed extraction in marine scenarios.Shore-based scenes include scenes on cloudy days (scene 1 in Figure 6), scenes on sunny days [38] (scene 2 in Figure 6), and scenes with wave disturbance [38] (scene 3 in Figure 6).The self-built dataset includes the scene of many small target ships (scene 4 in Figure 6), the scene with normal light (scene 5 in Figure 6), and the scene with low light (scene 6 in Figure 6).

Haze Removal of Marine Images
To verify the performance of the framework on haze removal, Retinex [40], Dark Channel Prior [12], Contrast Limited Adaptive Histogram Equalization (CLAHE) [41], AOD-Net [9], and e-AOD-Net adopted in the framework are used to conduct a comparative experiment on haze removal.The experimental results are shown in Figure 8.It should be noted that, when training the haze removal network, this paper [39] is referred to adding synthetic haze images to improve the generalization performance of the haze removal network in marine scenes.We synthesized image datasets with three different haze concentrations, and in this study, T represents the haze concentration.The images with three different haze concentrations are respectively represented as the images with T = 0.3, the images with T = 0.5, and the image data with T = 0.7, among which the images with T = 0.3 has the lowest haze concentration, as is shown in Figure 7.
YOLOv5 algorithm to conduct the multi-ship tracking experiment and the simulation experiment of ship speed extraction in marine scenarios.Shore-based scenes include scenes on cloudy days (scene 1 in Figure 6), scenes on sunny days [38] (scene 2 in Figure 6), and scenes with wave disturbance [38] (scene 3 in Figure 6).The self-built dataset includes the scene of many small target ships (scene 4 in Figure 6), the scene with normal light (scene 5 in Figure 6), and the scene with low light (scene 6 in Figure 6).

Haze Removal of Marine Images
To verify the performance of the framework on haze removal, Retinex [40], Dark Channel Prior [12], Contrast Limited Adaptive Histogram Equalization (CLAHE) [41], AOD-Net [9], and e-AOD-Net adopted in the framework are used to conduct a comparative experiment on haze removal.The experimental results are shown in Figure 8.

Haze Removal of Marine Images
To verify the performance of the framework on haze removal, Retinex [40], Dark Channel Prior [12], Contrast Limited Adaptive Histogram Equalization (CLAHE) [41], AOD-Net [9], and e-AOD-Net adopted in the framework are used to conduct a comparative experiment on haze removal.The experimental results are shown in Figure 8.The images generated by the above methods after haze removal are shown in Figure 8. Retinex and Dark Channel Prior usually lead to image distortion, and the color of the images after haze removal is seriously abnormal.CLAHE usually makes the color of images after haze removal too dark, and maritime ship features are not prominent.After haze removal by the AOD-Net network, there is still noise remaining in the images.These phenomena may occur because none of the above competing methods can fully extract the target structural features from ocean images.By contrast, e-AOD-Net can learn more structural features of images in marine scenes after generalization training and adaptive enhancement of marine images.The evaluation results of the dehazed images in multiple scenes using the PSNR and SSIM are presented in Table 1.As shown in Table 1, the e-AOD-Net adopted in this study has stable and good performance in multiple marine scenes, indicating that e-AOD-Net achieves better image enhancement performance.Images after haze removal can highlight more ship information, which is the basis of ship detection and tracking.The images generated by the above methods after haze removal are shown in Figure 8. Retinex and Dark Channel Prior usually lead to image distortion, and the color of the images after haze removal is seriously abnormal.CLAHE usually makes the color of images after haze removal too dark, and maritime ship features are not prominent.After haze removal by the AOD-Net network, there is still noise remaining in the images.These phenomena may occur because none of the above competing methods can fully extract the target structural features from ocean images.By contrast, e-AOD-Net can learn more structural features of images in marine scenes after generalization training and adaptive enhancement of marine images.The evaluation results of the dehazed images in multiple scenes using the PSNR and SSIM are presented in Table 1.As shown in Table 1, the e-AOD-Net adopted in this study has stable and good performance in multiple marine scenes, indicating that e-AOD-Net achieves better image enhancement performance.Images after haze removal can highlight more ship information, which is the basis of ship detection and tracking.

Multi-Ship Detection and Tracking Experiment after Images Enhancement
In this part, in order to verify the detection performance of the algorithm, SSD [44], Faster RCNN [45], and YOLO v4 [46] are compared with the YOLOv5 algorithm [47] adopted.The training comparison chart (Figure 9a), the verification comparison chart (Figure 9b), the frames per second (FPS) comparison chart (Figure 9c), the training index table (Table 2), the verification index table (Table 3), the test index table (Table 4) are drawn respectively.

Multi-Ship Detection and Tracking Experiment after Images Enhancement
In this part, in order to verify the detection performance of the algorithm, SSD [44], Faster RCNN [45], and YOLO v4 [46] are compared with the YOLOv5 algorithm [47] adopted.The training comparison chart ((a) in Figure 9), the verification comparison chart ((b) in Figure 9), the frames per second (FPS) comparison chart ((c) in Figure 9), the training index table (Table 2), the verification index table (Table 3), the test index table (Table  4) are drawn respectively.As can be seen from the comparison curve, under the same training conditions, the convergence speeds of the YOLOv4 and YOLOv5 algorithms are fast, and the YOLOv5 algorithm has a faster image-processing speed (in Figure 9c).The test and evaluation parameters show that the trained YOLOv5 algorithm also performs well in detecting accuracy in ocean scenes.To verify the performance of images after haze removal by e-AOD-Net in ship detection tasks, we adopted the stable YOLOv5 algorithm to detect ships in  As can be seen from the comparison curve, under the same training conditions, the convergence speeds of the YOLOv4 and YOLOv5 algorithms are fast, and the YOLOv5 algorithm has a faster image-processing speed (in Figure 9c).The test and evaluation parameters show that the trained YOLOv5 algorithm also performs well in detecting accuracy in ocean scenes.To verify the performance of images after haze removal by e-AOD-Net in ship detection tasks, we adopted the stable YOLOv5 algorithm to detect ships in the synthesized hazy images and images after haze removal and compared the detection results.The detection results are presented in Figure 10.
the synthesized hazy images and images after haze removal and compared the detection results.The detection results are presented in Figure 10. Figure 10 shows that YOLOv5 has high-precision detection performance in multiple ocean scenarios.However, the degree of recovery of the images after haze removal is high, and the structural features of the ships are prominent, making it easy for the ship target in the image to be detected by the YOLOv5 algorithm, such as the small ships in the red boxes in scene 4. Haze noise in images reduces the accuracy of the target detection.In the case of high haze concentrations, some ships were not detected, such as those in the red boxes in scenes 5 and 6.
Considering the accuracy, detection speed, and detection stability of the algorithm in synthetic haze scenes, the YOLOv5 algorithm is ideal for ship detection in maritime scenarios.In the evaluation of the tracking algorithms, the YOLOv5 algorithm was used as the detector in scenarios 4, 5, and 6.In the three scenarios, multi-objective tracking evaluation parameters were introduced to evaluate the tracking performance of the SORT and Deep SORT algorithms for ships in ocean scenarios.The evaluation results are presented in Tables 5-7.In the tables, parameters with upward arrows indicate that the evaluated method performs better when the evaluated value is larger; those with downward arrows indicate that the evaluated method performs better when the evaluated value is smaller.And the optimal evaluation values when the hazy concentration is T = 0.3, T = 0.5, and T = 0.7 have been highlighted in red, yellow, and green, respectively in Tables 5-7. 1 IDF1: The ratio of correctly identified detections over the average number of ground-truth and computed detections; 2 MOTA (multi-object tracking accuracy): This measure combines three error Figure 10 shows that YOLOv5 has high-precision detection performance in multiple ocean scenarios.However, the degree of recovery of the images after haze removal is high, and the structural features of the ships are prominent, making it easy for the ship target in the image to be detected by the YOLOv5 algorithm, such as the small ships in the red boxes in scene 4. Haze noise in images reduces the accuracy of the target detection.In the case of high haze concentrations, some ships were not detected, such as those in the red boxes in scenes 5 and 6.
Considering the accuracy, detection speed, and detection stability of the algorithm in synthetic haze scenes, the YOLOv5 algorithm is ideal for ship detection in maritime scenarios.In the evaluation of the tracking algorithms, the YOLOv5 algorithm was used as the detector in scenarios 4, 5, and 6.In the three scenarios, multi-objective tracking evaluation parameters were introduced to evaluate the tracking performance of the SORT and Deep SORT algorithms for ships in ocean scenarios.The evaluation results are presented in Tables 5-7.In the tables, parameters with upward arrows indicate that the evaluated method performs better when the evaluated value is larger; those with downward arrows indicate that the evaluated method performs better when the evaluated value is smaller.And the optimal evaluation values when the hazy concentration is T = 0.3, T = 0.5, and T = 0.7 have been highlighted in red, yellow, and green, respectively in Tables 5-7.
As shown in Tables 5-7, the Deep SORT algorithm using YOLOv5 as a detector has higher MOTA and MOTP values as well as lower IDS and ML values in the above scenes.This indicates that Deep SORT can track ships stably while avoiding the number of ID transitions.It should be noted that in the same scenario, the evaluation results of tracking algorithms that use images after haze removal are usually the optimal values, indicating that images after haze removal can effectively improve the robustness of target tracking algorithms in maritime scenarios.It is worth noting that the YOLOv5 algorithm combined with the Deep SORT target tracking algorithm adopted in the framework can maintain high detection accuracy and stable tracking performance in multi-ship tracking, which is the basis for accurate ship speed extraction in this study.

Ship Speed Extraction
In this section, AIS data are considered as the ground truth of the ship speed values.The AIS data of Baosteel Wharf on 5 April 2021 is downloaded from the website http://www.shipxy.com.To make it easier to compare the ground truth of the ship speed values with the shipping speed extracted from marine images, the AIS data were linearly interpolated to match the video image frame by frame after extracting the speed data.To highlight the effect of this framework on hazy images and the performance of ship speed extraction, this section selects a ship in scenes 4, 5, and 6 for the speed extraction simulation experiment and compares the ground truth with the extracted speed of each ship in the scene with different haze concentrations.The extraction and comparison results for the ship speed are shown in Figures 11-13.The mean speed of each ship and the MSE and MAE values compared to the ground truth of the speed are listed in Tables 8-10.In the Tables 8-10, the highlighted in yellow indicates the speed results and speed evaluation results extracted from the haze videos, the highlighted in green indicates the speed results and speed evaluation results extracted from the videos after haze removed, and the highlighted in red indicates the true value of speed extracted from the AIS data.The misalignment between the annotated and the predicted bounding boxes; 4 MT: The ratio of ground-truth trajectories that are covered by a track hypothesis for at least 80% of their respective life span; 5 ML: The ratio of ground-truth trajectories that are covered by a track hypothesis for at most 20% of their respective life span; 6 FP: The total number of false positives; 7 FN: The total number of false negatives (missed targets); 8 IDS: The total number of identity switches.
(The meaning of the evaluation parameters in Tables 6 and 7 is the same as described above.)As shown in Figures 11-13, the fold line representing the speed value of the ship extracted from the AIS data is set to red in the figure; the fold line graph representing the speed value of the ship extracted directly from the haze video is set to yellow, and the fold line representing the speed value of the ship extracted by our framework after removing the haze from the maritime haze image is set to blue.According to the extraction results and the mean ground truth of the ship speed, the speeds of the three ships were at 7.71 Kn, 7.50 Kn, and 7.70 Kn, respectively.For ship No. 1 in Figure 11, the accuracy of the speed extracted is easily affected by noise in the images owing to the small sizes of the ships.When T = 0.3, the MSE and MAE values of the speed are 0.37 Kn and 0.49 Kn due to the slight noise in the images.After haze removal, the fluctuation of the velocity image improved.At this time, the values of MSE are 0.12 Kn, the values of MAE are 0.21 Kn, and the ship's average speed is improved from 7.54 Kn to 7.73 Kn, which is closer to the average speed of the ground truth of the ship speed.When T = 0.5, due to the influence of haze noise in the images, the curve chart of ship velocity fluctuates wildly, especially in the late video period, and the MSE of ship velocity is 1.71 Kn, and MAE is 1.03 Kn.Although the extraction value of the velocity after haze removal still fluctuated, it was significantly improved compared with that before haze removal.After removing the haze, the MSE and MAE of the ship speed extracted from the image are 0.33 Kn and 0.41 Kn, and the average ship speed was 7.75 Kn.
When T = 0.7, the velocity fluctuation was more prominent.Currently, the MSE and MAE of ship velocity are 3.57 Kn and 1.14 Kn.After haze removal, the fluctuation of the velocity curve chart decreased.Both the MSE and MAE of the ship velocity decreased, and the mean value of the ship velocity was closer to the ground truth.
The same situation appeared in ship No. 2 in Figure 12.It can be seen from the truth line chart is approximately 7.5 Kn.According to the MSE and MAE of ship No. 2, the accuracy of the ship speed extracted can be improved by removing haze.
For ship No. 3 in scenario 6, the curve chart of ship speed fluctuates greatly because the image brightness is low, and the accuracy of the ship speed extracted is reduced after the haze noise is superimposed.The MSE of ship speed under different haze concentration environments was 0.47 Kn, 0.58 Kn, and 0.65 Kn, respectively.The MAE of speed is 0.60 Kn, 0.60 Kn, and 0.66 Kn.The MSE of speed extracted after removing haze is 0.14 Kn, 0.16 Kn, and 0.22 Kn.The MAE is 0.27 Kn, 0.31 Kn, and 0.38 Kn, respectively.After haze removal, the average ship speed extracted from the images was closer to the average value of the ground truth.It shows that the framework adopted in this paper can effectively enhance the quality of haze images in ocean scenes with low brightness and improve the accuracy of ship speed extracted from the images.

Conclusions
In this study, a framework for ship detection and ship speed extraction from maritime haze images using deep learning methods is proposed.First, a lightweight CNN was used to remove haze from hazy marine images.Second, YOLO v5 is used to accurately detect ships in marine images after haze removal.Moreover, the Deep-SORT target tracking algorithm is used to track ships.Finally, the ship motion pixels are calculated according to the trajectory information of the ship between adjacent image frames, and the ship speed is estimated and extracted based on the mapping relationship between the image space and the actual space.
Experimental results demonstrate that the proposed framework effectively enhances the clarity and contrast of marine haze images, as indicated by the mean peak signal-tonoise ratio (PSNR) and mean structural similarity index (SSIM) values of 23.86 and 0.96, respectively.The framework achieves high accuracy in extracting ship speed in multiple marine scenes, with an average accuracy above 95% and strong stability.The proposed speed extraction framework significantly improves the accuracy of ship speed extraction in hazy environments, with the mean squared error (MSE) values of ship speed extracted from the images after haze removal averaging 0.3 Kn lower than those from the images before haze removal.
In future studies, additional marine scenarios will be considered to further verify the practicality of this framework in real-world scenarios.

Figure 1 .
Figure 1.Frame diagram of the method.

Figure 1 .
Figure 1.Frame diagram of the method.

1 t
(x) and A into K(x).The e-AOD-Net builds an adaptive depth estimation model based on the physical model of atmospheric scattering and trains the network by minimizing the error between the pixel values of clear and hazy images.
image generation module as an adaptive parameter.The function of the ( ) K x estimation module is to estimate the depth and haze concentration of hazy images.At the same time, Equation (

Figure 3 .
Figure 3.The network structure of YOLOv5.

Figure 3 .
Figure 3.The network structure of YOLOv5.

Figure 4 .
Figure 4. Projection perspective between the camera coordinate system and the image co system.

Figure 5 .
Figure 5.Translation relationship between the image coordinate system and the pixel co system.

Figure 4 .
Figure 4. Projection perspective between the camera coordinate system and the image coordinate system.

Figure 4 .
Figure 4. Projection perspective between the camera coordinate system and the image co system.

Figure 5 .
Figure 5.Translation relationship between the image coordinate system and the pixel co system.

Figure 5 .
Figure 5.Translation relationship between the image coordinate system and the pixel coordinate system.

Figure 6 .
Figure 6.Experimental scenes.It should be noted that, when training the haze removal network, this paper [39] is referred to adding synthetic haze images to improve the generalization performance of the haze removal network in marine scenes.We synthesized image datasets with three different haze concentrations, and in this study, T represents the haze concentration.The images with three different haze concentrations are respectively represented as the images with 0.3 T = , the images with 0.5 T = , and the image data with 0.7 T = , among which

Figure 6 .
Figure 6.Experimental scenes.It should be noted that, when training the haze removal network, this paper [39] is referred to adding synthetic haze images to improve the generalization performance of the haze removal network in marine scenes.We synthesized image datasets with three different haze concentrations, and in this study, T represents the haze concentration.The images with three different haze concentrations are respectively represented as the images with 0.3 T = , the images with 0.5 T = , and the image data with 0.7 T = , among which

Figure 10 .
Figure 10.Comparison of detection effects of YOLOv5 on images before and after haze removal (Detection details are highlighted with red boxes).

Figure 10 .
Figure 10.Comparison of detection effects of YOLOv5 on images before and after haze removal (Detection details are highlighted with red boxes).

Author Contributions:
Conceptualization, J.Z., X.C. and Z.Z.; methodology, Z.Z. and Y.C.; software, Z.Z. and Y.C.; validation, Z.Z., Y.C. and J.Z.; investigation, Z.Z. and X.C.; data curation, Z.Z., Y.C. and X.C.; writing-original draft preparation, Z.Z.; writing-review and editing, J.Z. and X.C.; supervision, J.Z. and X.C.; funding acquisition, J.Z. and X.C.All authors have read and agreed to the published version of the manuscript.Funding: This research was jointly funded by the National Key Research and Development Program of China under Grant 2021YFC2801004 and 2021YFC2801003, National Natural Science Foundation of China under Grant 52102397, 51709167 and 52071199, Shanghai Science and Technology Innovation Action Plan under Grant Nos.22DZ1204503, China Postdoctoral Science Foundation under Grant 2021M700790.

Table 1 .
Evaluation results of dehazed images.(The best results are highlighted in red).

Table 1 .
Evaluation results of dehazed images.(The best results are highlighted in red).

Table 5 .
Evaluation results of multi-ship tracking in scene 4.

Table 5 .
Evaluation results of multi-ship tracking in scene 4. The ratio of correctly identified detections over the average number of ground-truth and computed detections; 2 MOTA (multi-object tracking accuracy): This measure combines three error sources: false positives, missed targets, and identity switches; 3 MOTP(Multiple Object Tracking Precision):

Table 6 .
Evaluation results of multi-ship tracking in scene 5.

Table 9 .
MSE values of ship speed.

Table 10 .
MAE values of ship speed.

Table 7 .
Evaluation results of multi-ship tracking in scene 6.

Table 9 .
MSE values of ship speed.