Real-Time Wire Rope Inclination Detection Using YOLOv9-Based Camera–LiDAR Fusion for Overhead Cranes

Pham, Anh-Hung; Jung, Ga-Eun; Mai, Xuan-Kien; Go, Byeong-Soo; Lee, Seok-Ju

doi:10.3390/jmse14040393

Open AccessArticle

Real-Time Wire Rope Inclination Detection Using YOLOv9-Based Camera–LiDAR Fusion for Overhead Cranes

by

Anh-Hung Pham

¹

,

Ga-Eun Jung

¹

,

Xuan-Kien Mai

¹

,

Byeong-Soo Go

² and

Seok-Ju Lee

^3,*

¹

Department of Electrical Engineering, Changwon National University, Changwon 51140, Republic of Korea

²

Research Institute of DNA+, Changwon National University, Changwon 51140, Republic of Korea

³

School of Aerospace Engineering, Glocal Advanced Institute of Science & Technology, Changwon National University, Changwon 51140, Republic of Korea

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2026, 14(4), 393; https://doi.org/10.3390/jmse14040393

Submission received: 16 January 2026 / Revised: 6 February 2026 / Accepted: 17 February 2026 / Published: 20 February 2026

(This article belongs to the Section Ocean Engineering)

Download

Browse Figures

Versions Notes

Abstract

Safe and efficient cargo handling is essential in modern port logistics, where overhead cranes are widely used to move containers, bulk materials, and heavy equipment. Accurate real-time measurement of wire rope inclination is critical for preventing collisions, reducing load sway, and enabling autonomous crane operation under challenging maritime conditions. This paper presents a You Only Look Once v9 (YOLOv9)-based camera–LiDAR fusion system for real-time estimation of the trolley–hook rope inclination angle in overhead cranes. A monocular industrial camera and a YOLOv9 detector provide semantic region-of-interest (ROI) masks for the trolley and hook, while a 3D LiDAR sensor, rigidly mounted and extrinsically calibrated to the camera, provides depth information. LiDAR points projected onto the image and filtered by YOLOv9 bounding boxes allow efficient extraction of safety-critical 3D geometry and reconstruction of the rope vector. Experimental results on an overhead crane testbed show that the proposed fusion estimator achieves an angle RMSE below 1 degree in dynamic swing and low-illumination scenarios, significantly outperforming a camera-only baseline (RMSE ≈ 2.11). These metrically validated results indicate that the proposed detection pipeline offers a robust foundation for intelligent crane monitoring and automation in maritime logistics and smart port operations.

Keywords:

camera–LiDAR fusion; crane automation; overhead cranes; wire rope inclination; YOLO algorithm

1. Introduction

Overhead cranes are widely used in container yards, shipyards, and port terminals to lift and transport heavy loads in confined and cluttered spaces [1,2,3,4,5]. Safe and efficient operation depends on accurate control and monitoring of trolley motion, hoist motion, and hook behavior. Collision, excessive load sway, and uncontrolled hook motion can cause damage, downtime, and safety incidents, directly affecting terminal throughput and vessel turnaround time [6,7,8,9,10]. For advanced functions such as automatic hook positioning, active sway reduction, and safety zone monitoring, the relationship between the trolley–hook system and the surrounding environment must be captured in a geometrically consistent manner [11,12].

Traditional crane installations often rely on limit switches, encoders, and simple range sensors that provide only limited one-dimensional or two-dimensional information and do not represent the full three-dimensional geometry of the trolley–hook system. Modern approaches therefore increasingly exploit camera-based perception, where deep learning detectors such as YOLO enable real-time recognition of crane related objects and structures in industrial scenes [13,14,15,16,17]. In addition, recent variants support real-time three-dimensional detection and tracking on LiDAR point clouds while maintaining frame rates suitable for embedded systems [18,19,20].

LiDAR complements cameras by providing metric three-dimensional measurements in the LiDAR coordinate frame. In multi-sensor fusion, LiDAR points can be projected into the camera image plane so that geometry and appearance are processed jointly, and prior studies show that combining LiDAR geometry with camera appearance improves detection and distance estimation compared with using either modality alone [21,22,23,24]. Accurate camera LiDAR fusion relies on precise extrinsic calibration, where target-based methods estimate the rigid body transformation between the camera and LiDAR frames and refine it via optimization [25,26,27,28]. More recent frameworks such as You Only Calibrate Once (YOCO) automate much of the process and emphasize repeatable and accurate alignment for long term deployments [29].

Most existing camera–LiDAR fusion studies target road traffic applications and are not directly tailored to crane operation in industrial or maritime logistics environments. Prior crane vision research often focuses on collision avoidance around ship unloaders or working point extraction inside cargo holds [6,30], while relative pose estimation between the trolley and suspended hook remains less explored. Meanwhile, automated terminals increasingly require real-time perception modules that remain reliable under harsh illumination, vibration, and background clutter [31]. These requirements motivate a dedicated perception method that can estimate rope inclination and trolley–hook geometry with minimal additional instrumentation.

In practice, sway suppression is often implemented using PID controllers that rely on dedicated inclination or angle sensors. However, installing and maintaining such sensors is difficult on large scale cranes operating in harsh environments. Conventional sensor-based solutions also tend to require multiple devices and supporting infrastructure, increasing deployment and maintenance costs and making measurements more vulnerable to vibration, misalignment, and environmental exposure. In contrast, a camera–LiDAR fusion framework can provide noncontact and infrastructure light inclination feedback by estimating the trolley–hook geometry directly from three-dimensional measurements, enabling robust operation under partial occlusion and illumination variations.

The present study addresses the lack of camera–LiDAR fusion methods specifically designed to estimate wire rope inclination and trolley–hook pose in overhead cranes operating in maritime logistics. A YOLOv9-based camera–LiDAR fusion model is developed to provide non-contact, geometry-based measurements of rope inclination and trolley–hook distances in real time. A monocular industrial camera, together with a YOLOv9 detector trained on crane-specific images, is used to detect and label the trolley and hook in RGB frames. In the proposed architecture, camera detections primarily provide robust semantic localization of crane components, and an optional vision-based estimate of relative distance and tilt in the image plane is used only as a baseline for comparison. A three-dimensional LiDAR sensor is rigidly mounted and extrinsically calibrated with respect to the camera frame following projection-based calibration concepts reported in the literature [27,29]. LiDAR points are projected into the image plane, and points that lie inside the YOLOv9 bounding boxes are associated with the corresponding objects. To focus computation on safety-critical regions and to keep the processing load compatible with real-time overhead crane operation, the fusion stage restricts point cloud processing to LiDAR points falling inside the image regions corresponding to the detected trolley and hook bounding boxes, consistent with region-based depth extraction strategies proposed in previous camera–LiDAR fusion studies. This fusion step can compute distances from the trolley to the hook and the wire rope inclination angle directly in the LiDAR frame. The proposed model enables real-time wire rope inclination detection and accurate geometric estimation, suitable for integration into automatic overhead crane applications.

Experiments on a laboratory overhead crane show that the proposed camera–LiDAR fusion model achieves an inclination RMSE of about one across static, dynamic, and low-illumination scenarios, clearly outperforming a camera-only baseline. This level of accuracy supports safe loading and unloading conditions. It facilitates integration into automated crane systems in ports, shipyards, and maritime logistics chains, where robust detection under varying illumination and cluttered backgrounds is essential.

To address these challenges, the primary contributions of this work are threefold.

Geometric formulation: The trolley–hook configuration and inclination angle are formulated directly in the 3D LiDAR frame, providing a physically meaningful metric that overcomes the depth ambiguity of monocular vision.
Crane-specific fusion pipeline: A fine-tuned YOLOv9 detector is utilized to semantically localize the trolley and hook, creating projection-based regions of interest (ROI) that constrain LiDAR point processing. This approach isolates the target 3D structures, enabling the precise derivation of the wire rope inclination in real time.
Experimental validation: The system is validated on an overhead crane testbed under static, dynamic swing, and low-light conditions, demonstrating superior accuracy (RMSE < 1°) compared to camera-only baselines.

2. Materials and Methods

Before presenting the proposed pipeline, this section briefly motivates the sensor choice for overhead crane monitoring. Table 1 summarizes representative approaches and compares their trade-offs in installation effort, environmental robustness, metric three-dimensional accuracy, and real-time suitability. This comparison clarifies the gap addressed by the crane-specific camera and LiDAR fusion framework described in the following subsections.

As shown in Table 1, traditional encoder or IMU based sensing offers high precision but increases installation complexity and maintenance burden. Camera only perception is lightweight but suffers from depth ambiguity and sensitivity to lighting. The proposed camera and LiDAR fusion framework are designed to balance these trade-offs for crane- specific trolley and hook geometry.

2.1. Coordinate Frames of Camera and LiDAR

The objective of this work is to estimate in real time the 3D relative pose between the trolley and the suspended hook of an overhead crane and to derive a physically meaningful measure of wire rope inclination. The proposed system receives as inputs a monocular RGB image and a 3D LiDAR point cloud acquired synchronously from sensors mounted on or near the crane structure. From these data, the perception module must (i) detect the trolley and hook in the image, (ii) associate LiDAR points with the detected objects, and (iii) compute the 3D distance between trolley and hook as well as the inclination angle of the wire rope.

Let

F_{L} = {X_{L}, Y_{L}, Z_{L}}

denote the LiDAR coordinate frame, in which raw point clouds are expressed as:

P_{L} = {[X_{L}, Y_{L}, Z_{L}]}^{T}

(1)

where

P_{L}

is the 3D coordinate vector of a generic LiDAR point in the LiDAR frame

F_{L}

, and

X_{L}, Y_{L}

, and

Z_{L}

are its coordinates along the LiDAR

X

,

Y

, and

Z

axes, respectively.

The LiDAR sensor is rigidly mounted on the crane superstructure such that its

Z_{L}

is approximately aligned with the global vertical direction, while

X_{L}

and

Y_{L}

span the horizontal plane. This mounting configuration makes

F_{L}

a natural reference for expressing wire rope geometry and defining an inclination angle relative to the vertical axis.

The monocular camera is described by the camera coordinate frame

F_{C} = {X_{C}, Y_{C}, Z_{C}}

, where

Z_{C}

points along the optical axis and

X_{C}

and

Y_{C}

are aligned with the horizontal and vertical axes of the image. A 3D point coordinate in the camera frame can be expressed:

P_{C} = {[X_{C}, Y_{C}, Z_{C}]}^{T}

(2)

where

P_{C}

is the 3D coordinate vector of a point expressed in the camera frame

F_{C}

, and

X_{C}

,

Y_{C}

, and

Z_{C}

denote its coordinates along the camera

X

,

Y

, and

Z

axes, respectively.

A 3D point in the camera frame is mapped to pixel coordinates

(u, v)

by the pinhole camera model in (3):

[\begin{matrix} u \\ v \\ 1 \end{matrix}] = K [\begin{matrix} X_{C} / Z_{C} \\ Y_{C} / Z_{C} \\ 1 \end{matrix}]

(3)

where

u

and

v

denote the horizontal and vertical pixel coordinates on the image plane,

K

is the intrinsic calibration matrix containing the focal lengths and principal point of the camera, and

X_{C}

,

Y_{C}

,

Z_{C}

denote the corresponding 3D point in the camera frame

F_{C}

. The rigid transformation between LiDAR and camera frames is described by the extrinsic parameters

(R, t)

, which relate a point

P_{L}

in

F_{L}

to its coordinates

P_{C}

in

F_{C}

according to (4):

P_{C} = R \cdot P_{L} + t

(4)

where

R

is the rotation matrix, and

t

is the translation vector from

F_{L}

to

F_{C}

. These parameters are obtained offline through an extrinsic calibration procedure and are assumed fixed during operation.

Within this geometric framework, the trolley and the hook are represented by the 3D coordinates of their effective centers in the LiDAR frame, denoted by (5):

C_{t, L} = {[X_{t, L}, Y_{t, L}, Z_{t, L}]}^{T}, C_{h, L} = {[X_{h, L}, Y_{h, L}, Z_{h, L}]}^{T}

(5)

where

X_{t, L}

,

Y_{t, L}

,

Z_{t, L}

are the coordinates of the trolley center along the LiDAR and

X_{h, L}

,

Y_{h, L}

,

Z_{h, L}

are defined analogously for the hook center.

These centers are not measured directly; instead, they are estimated by projecting LiDAR points into the image, selecting points that fall inside YOLOv9 detection bounding boxes for the trolley and hook, and then aggregating the selected points in

F_{L}

using a robust clustering strategy. Once

C_{t, L}

and

C_{h, L}

are available, the 3D vector from trolley to hook in the LiDAR frame is defined as Equation (6):

v_{L} = c_{h, L} - c_{t, L} = {[d_{x}, d_{y}, d_{z}]}^{T}

(6)

where

v_{L}

is the 3D vector from the trolley center to the hook center expressed in the LiDAR frame, and

d_{x}, d_{y}, d_{z}

denote its components along

X_{L}

,

Y_{L}

,

Z_{L}

, respectively.

The wire rope is assumed to coincide with the straight-line segment connecting the trolley and hook centers. The horizontal displacement

h

of the rope in the LiDAR frame is then quantified by Equation (7):

h = \sqrt{d_{x}^{2} + d_{y}^{2}}

(7)

where

h

is the horizontal displacement of the rope in the LiDAR frame, and

d_{x}

and

d_{y}

are the

X_{L}

and

Y_{L}

components of the vector

v_{L} = {[d_{x}, d_{y}, d_{z}]}^{T}

from the trolley center to the hook center, respectively. The vertical displacement

v

is given by Equation (8):

v = | d_{z} |

(8)

Under the assumption that the

Z_{L}

is aligned with the direction of gravity, the wire rope inclination angle

θ

with respect to the vertical can be defined as:

θ = a r c t a n 2 (h, v)

(9)

where

θ = 0 °

corresponds to a perfectly vertical rope and larger values of

θ

indicate increasing deviation from the vertical and therefore higher risk of excessive load sway or collision.

In summary, the perception problem considered in this study is to estimate

C_{t, L}

,

C_{h, L}

, and the resulting quantities

‖ v_{L} ‖

and

θ

for each sensor frame, using only the camera image, LiDAR point cloud, and the known extrinsic transformation between

F_{L}

and

F_{C}

. The subsequent subsections describe the proposed camera–LiDAR fusion algorithm that solves this problem by combining YOLOv9-based object detection, projection-based region of interest selection, and robust clustering in the LiDAR frame.

2.2. Sensor Specifications and Mounting Configuration

The proposed perception , as shown in Figure 1, is implemented on a laboratory overhead crane and consists of a monocular APC930 QHD RGB camera, a Velodyne VLP 16 LiDAR sensor, and an industrial PC running fusion software. The LiDAR vertical axis is approximately aligned with gravity so that the LiDAR frame

F_{L}

can be used directly as the reference for three-dimensional rope geometry and inclination computation, while the camera is tilted upwards to keep the trolley and hook within the image field of view over their operating range. The main sensor characteristics relevant to the experiments, including resolution, field of view, range, and frame rate, are summarized in Table 2.

At each perception cycle, the RGB image is processed by a YOLOv9 detector trained on crane-specific data to obtain bounding boxes and class labels for the trolley and the hook. In parallel, the latest LiDAR scan is transformed from the LiDAR frame to the camera frame using the extrinsic parameters

(R, t)

and projected into pixel coordinates through the pinhole model. This associates each LiDAR point with a location in the image plane and allows points whose projections fall inside a YOLOv9 bounding box to be assigned to the corresponding object. Subsequent processing is carried out in the LiDAR frame: associated points are filtered by simple distance-based rules to remove background and structural clutter, then a robust clustering step is used to estimate the object centers

C_{t, L}

and

C_{h, L}

. From these centers, the three-dimensional rope vector, its Euclidean length, and the inclination angle are computed as defined in Section 2.1. The estimated angles and distances are finally forwarded to the crane control or monitoring system, while restricting LiDAR processing to regions of interest defined by the YOLOv9 detections keeps the computational load compatible with real-time operation on a conventional PC platform.

2.3. Camera–LiDAR Calibration

Reliable estimation of the trolley–hook geometry requires an accurate rigid transformation between the LiDAR frame

F_{L}

and the camera frame

F_{C}

used in Equations (3) and (4). In this work, a standard camera–LiDAR calibration procedure is performed offline before the experiments to identify the extrinsic parameters

(R, t)

and to refine the camera intrinsics

K

, Figure 2 schematically illustrates this calibration setup.

A planar checkerboard target is placed in the hoisting area and successively moved to different positions and orientations within the common field of view of the APC930 camera and the Velodyne VLP-16 LiDAR. For each pose of the target, one RGB image and the corresponding LiDAR scan are recorded. Corner points of the checkerboard are detected in the image, while the associated LiDAR returns are obtained by projecting the point cloud onto the image using initially assumed camera–LiDAR extrinsic parameters defined in (3) and (4) and selecting points that lie on the checkerboard region. A plane is then fitted to these selected LiDAR points to recover the 3D coordinates of the checkerboard corners in the LiDAR frame.

Given the 3D corner locations in

F_{L}

and their 2D pixel coordinates in the image, the extrinsic parameters

R

and

t

are estimated by solving a nonlinear least squares problem that minimizes the reprojection error under the pinhole model (3). The intrinsic matrix

K

is obtained from a conventional camera calibration at full resolution and, if necessary, is slightly refined during this optimisation. The resulting calibration yields subpixel reprojection errors on the training set.

The quality of the calibration is verified qualitatively by projecting dense LiDAR point clouds into several crane images using the estimated

(R, t)

and checking that structural edges of the bridge, trolley, and hook align well with the corresponding image features. After the offline step, the matrices,

K

,

R

, and

t

are kept fixed throughout all experiments.

2.4. Real-Time Data Acquisition and Synchronization

Real-time operation of the proposed perception system is achieved by acquiring LiDAR data and camera images in parallel and fusing the most recent LiDAR frame with each incoming RGB image, as summarized in Figure 3. LiDAR scans from the Velodyne VLP-16 are received via UDP and accumulated in a background thread to form a three-dimensional point cloud in the LiDAR frame

F_{L}

at a nominal rate of 10 Hz. Only returns with non-zero range are converted to Cartesian coordinates and stored in a shared buffer together with their timestamp.

Camera images are acquired in the main thread using a standard video capture interface. As listed in Table 3, the camera operates at a reduced processing resolution compared with the native calibration resolution, and the intrinsic matrix used in the projection model is obtained by scaling the full-resolution calibration matrix according to the horizontal and vertical resize factors. This preserves metric consistency while keeping the image size suitable for real-time YOLOv9 inference on the target PC. The specific input resolution of 640 × 640 pixels was selected to align with the YOLOv9 architecture, offering an optimal balance between computational efficiency and detection performance. Given that the trolley and hook are prominent structures, this resolution ensures sufficient accuracy while maintaining the low latency required for real-time control.

At each perception cycle, the main thread reads the latest LiDAR frame from the shared buffer and projects all LiDAR points into the current image using the extrinsic transformation and the intrinsic matrix. Points with positive depth and valid pixel coordinates within the image bounds are retained, establishing the correspondence between LiDAR measurements and image locations. In parallel, the resized RGB frame is passed through the YOLOv9 network to produce bounding boxes and class labels for the trolley and the hook. The associated LiDAR points whose projected pixels fall inside each bounding box are then forwarded to the subsequent clustering and inclination computation steps described in Section 2.5 and Section 2.6.

Since the LiDAR receiver runs continuously and the buffer is flushed at a fixed interval, the temporal misalignment between images and point clouds is bounded by the LiDAR frame period. For the relatively slow trolley and hook motions considered on the experimental overhead crane, this level of synchronization is sufficient and does not introduce perceptible errors in the estimated distances or inclination angles. By decoupling LiDAR reception from the main processing thread and restricting LiDAR computation to object-centered regions of interest defined by YOLOv9 detections, the implementation maintains stable real-time perception rates on a conventional PC platform.

The real-time fusion pipeline depicted in Figure 3 proceeds by reading the latest accumulated LiDAR frame and projecting points into the image using the calibrated extrinsic parameters. Simultaneously, YOLOv9 inference is performed on the RGB frame to detect the trolley and hook. The projected LiDAR points falling within these detected bounding boxes are then associated with the corresponding objects and passed to the subsequent clustering and 3D inclination computation stages described in Section 2.5 and Section 2.6.

2.5. YOLOv9-Based Trolley and Hook Detection

In the proposed system, the camera is used to recognize the trolley and the hook and to provide compact regions of interest in the image plane for LiDAR processing. To ensure robust performance and reproducibility, the YOLOv9 detector was trained on a specific dataset of 1000 images capturing the laboratory overhead crane under varying trolley positions, hoisting heights, and illumination conditions. The images were manually annotated using bounding boxes on the Roboflow platform. To achieve high accuracy with this specific dataset, this study adopted a transfer learning approach in which the model was initialized with public pretrained weights and fine-tuned on the crane imagery. The training process was executed on a high-performance workstation and incorporated a comprehensive data augmentation pipeline to enhance generalization. Standard YOLOv9 strategies, including Mosaic and MixUp, were employed to improve geometric invariance, while photometric distortions (HSV) were applied to simulate diverse illumination environments. The detailed dataset distribution, hardware specifications, and specific training hyperparameters are summarized in Table 4.

Assuming the camera is calibrated at resolution, the coordinate scaling factors for mapping between the resized YOLOv9 input and the original image are defined as:

s_{u} = \frac{W}{W_{c a l}}, s_{v} = \frac{H}{H_{c a l}}

(10)

where

(\tilde{u}, \tilde{v})

denote pixel coordinates in the resized YOLOv9 input image and

(u, v)

the corresponding coordinates in the original image of width

W

and height

H

, and

W_{c a l}

H_{c a l}

denotes the image size used for camera calibration,

s_{u}

and

s_{v}

are the horizontal and vertical scaling factors between the calibration image and the resized input, respectively.

For the ROI selection in the fusion stage, each detection is represented by the center of its bounding box, computed as (11):

c_{I}^{k} = [\begin{matrix} u_{c}^{k} \\ v_{c}^{k} \end{matrix}] = \frac{1}{2} [\begin{matrix} u_{m i n}^{k} + u_{m a x}^{k} \\ v_{m i n}^{k} + v_{m a x}^{k} \end{matrix}]

(11)

where

u_{c}^{k}

and

v_{c}^{k}

denote the center of the

k

detected object in the resized YOLOv9 input image, and

u_{m i n}^{k}

,

v_{m i n}^{k}

denote the column and row coordinates of the top left corner in the resized image and

u_{m a x}^{k}

and

v_{m a x}^{k}

those of the bottom right corner.

Only detections whose confidence exceeds a threshold of 0.5 and whose intersection over union (IoU) with other detections is below 0.5 after non-maximum suppression are retained. In typical operating conditions, this results in a single high-confidence detection for the trolley and a single detection for the hook in each frame. These detections then define the ROIs used in the fusion stage: LiDAR points whose projections fall inside the trolley or hook bounding boxes are associated with the corresponding object and passed to the 3D center estimation procedure.

In addition to serving as ROIs for LiDAR fusion, the YOLOv9 detections can also be used to compute an image-based estimate of the wire rope inclination angle by considering the line segment connecting the centers of the trolley and hook bounding boxes. This 2D inclination provides a useful baseline for comparison with the 3D LiDAR-based inclination angle introduced later, highlighting the benefits of incorporating depth information from the LiDAR sensor.

2.6. LiDAR-Based 3D Center Estimation and Wire Rope Inclination Computation

After camera-based detection has provided image plane regions of interest (ROI) for the trolley and hook, the LiDAR modality is used to recover their three-dimensional positions and to compute the wire rope inclination angle. All geometric quantities in this section are expressed in the LiDAR frame

F_{L},

where the

Z_{L}

-axis is approximately aligned with the vertical direction.

For each object,

k \in {T, H}

(T is the trolley and H is the hook), let

S_{k}

denote the set of LiDAR points whose projections fall inside the corresponding YOLOv9 bounding box in the image plane. Each point in

S_{k}

is represented as:

P_{i, L}^{k} = [\begin{matrix} x_{i}^{k} \\ y_{i}^{k} \\ z_{i}^{k} \end{matrix}], i = 1, \dots, N_{k}

(12)

where

P_{i, L}^{k}

is the 3D coordinate of the

i - t h

LiDAR point associated with object

k

, expressed in the LiDAR frame

F_{L},

x_{i}^{k}

,

y_{i}^{k}

, and

z_{i}^{k}

are its Cartesian coordinates along

X_{L}

,

Y_{L}

, and

Z_{L}

, respectively, and

N_{k}

is the number of associated LiDAR points for object

k

. The raw ranges of these points with respect to the LiDAR origin are:

P_{i}^{k} = {‖ P_{i, L}^{k} ‖}_{2} = \sqrt{{(x_{i}^{k})}^{2} + {(y_{i}^{k})}^{2} + {(z_{i}^{k})}^{2}}

(13)

where

P_{i}^{k}

is the Euclidean range from the LiDAR origin to the associated point

P_{i, L}^{k}

for object

k

.

Because the trolley and hook are located in front of the sensor, background structures such as walls or ceiling beams often appear behind them in the same bounding box. A simple distance-based filtering step is applied to suppress background clutter. For each object

k

, the minimum range

d_{m i n}^{k}

is defined as Equation (14):

d_{m i n}^{k} = {}_{i}^{m i n}d_{i}^{k}

(14)

To suppress background clutter, the minimum range

d_{m i n}^{k}

is computed over all points in

S_{k}

, and only points within a small margin from this minimum are retained. The resulting near field subset is defined as Equation (15):

S_{k}^{n e a r} = {P_{i, L}^{k} \in S_{k} | d_{i}^{k} \leq d_{m i n}^{k} + ∆ d}

(15)

where

S_{k}^{n e a r}

denotes the near field subset of LiDAR points associated with object

k

,

S_{k}

is the full set of associated points,

P_{i, L}^{k} \in S_{k}

is the i-th associated LiDAR point,

d_{i}^{k}

is the range from the LiDAR origin,

d_{m i n}^{k}

and

∆ d

is a small distance margin.

This rule keeps points that are close to the visible surface of the object and discards more distant points that likely belong to the background.

A robust estimate of the 3D center of each object is obtained by taking the component-wise median of the near-field points. The LiDAR frame centers of the trolley and hook are defined as:

c_{T, L} = [\begin{matrix} m e d i a n {x_{i}^{T}} \\ m e d i a n {y_{i}^{T}} \\ m e d i a n {z_{i}^{T}} \end{matrix}], c_{H, L} = [\begin{matrix} m e d i a n {x_{i}^{H}} \\ m e d i a n {y_{i}^{H}} \\ m e d i a n {z_{i}^{H}} \end{matrix}]

(16)

where

c_{T, L}

and

c_{H, L}

denote the estimated 3D centers of the trolley and the hook in the LiDAR frame

F_{L}

, respectively. Here,

(x_{i}^{k}, y_{i}^{k}, z_{i}^{k})

are the Cartesian coordinates of points for

k \in {T, H}

. The use of medians instead of means reduces the influence of residual outliers and small clusters of background points that may survive the distance filter. While advanced methods such as DBSCAN and RANSAC are widely used for outlier handling in point cloud processing, this study adopts a lightweight pipeline that combines distance-based filtering with median based center estimation. This design choice is motivated by the strict real-time requirements of crane control. Since the YOLOv9 detector already constrains LiDAR processing to the object region of interest, the effective search space is substantially reduced and most background clutter is removed before 3D estimation. Within this restricted region, the coordinate wise median used in Equation (16) provides robust center estimates against random sensor noise and sparse outliers, while avoiding the computational overhead and timing variability of iterative clustering or robust fitting procedures.

The 3D trolley–hook vector in the LiDAR frame is then given by:

v_{L} = c_{H, L} - c_{T, L} = [\begin{matrix} {∆ x}_{L} \\ {∆ y}_{L} \\ {∆ z}_{L} \end{matrix}]

(17)

where

c_{T, L}

and

c_{H, L}

are the LiDAR frame centers of the trolley and the hook, respectively.

The Euclidean norm of this vector, which represents the 3D distance between the trolley and the hook, is defined as (18):

d_{T H} = {‖ v_{L} ‖}_{2}

(18)

where

{‖ v_{L} ‖}_{2}

denotes the Euclidean length of the 3D displacement vector between the trolley and the hook.

In addition, the ranges from the LiDAR sensor to the trolley and to the hook are defined as (19):

d_{T} = {‖ c_{T, L} ‖}_{2}, d_{H} = {‖ c_{H, L} ‖}_{2}

(19)

where

{‖ c_{T, L} ‖}_{2}

and

{‖ c_{H, L} ‖}_{2}

denote the Euclidean distances from the LiDAR origin to the trolley center and the hook center, respectively. This can be used for visualization and for enforcing simple consistency constraints in the implementation.

To compute the wire rope inclination angle, the 3D vector

v_{L}

is decomposed into horizontal and vertical components in the LiDAR frame. The corresponding magnitudes are defined as:

d_{h o r} = \sqrt{{∆ x}_{L}^{2} + {∆ y}_{L}^{2}}, d_{v e r} = | {∆ z}_{L} |

(20)

where

d_{h o r}

denotes the horizontal displacement magnitude of

v_{L}

in the

X_{L}

and

Y_{L}

plane,

d_{v e r}

denotes the vertical displacement magnitude along the

Z_{L}

axis, and

∆ x_{L}

,

∆ y

,

∆ z_{L}

are the

X_{L}

,

Y_{L}

,

Z_{L}

components of

v_{L}

, respectively.

The inclination angle

θ

between the wire rope and the vertical direction is then obtained from Equation (21):

θ = a r c t a n 2 (d_{h o r}, d_{v e r})

(21)

where

a r c t a n 2 (d_{h o r}, d_{v e r})

denotes the four-quadrant inverse tangent. In this convention,

θ = 0 °

corresponds to a perfectly vertical rope, while larger values of

θ

indicate increasing lateral deflection. The inclination angle in Equation (21) represents the effective suspension inclination defined by the line segment connecting the trolley and hook centers. This formulation assumes that the rope remains approximately straight within the tested operating conditions, which is reasonable in indoor crane operation where the rope is under high tension. If significant rope bending occurs due to strong wind or severe vibration, the proposed framework can be extended to estimate rope curvature by fitting a curve to multiple rope points.

In the real-time implementation, these computations are carried out at each perception cycle using the current LiDAR frame and the associated YOLOv9 detections. The resulting distances

d_{T H}, d_{T}, d_{H}

, and inclination angle

θ

are fed to the monitoring interface and can be used as inputs to higher-level crane control functions such as sway reduction or collision avoidance.

3. Experimental Setup and Test Scenarios

3.1. Overhead Crane Testbed and Data Collection Protocol

The proposed camera–LiDAR fusion system was validated on an industrial overhead crane installed in the laboratory. The crane consists of a single-girder bridge travelling along runway beams and a motorized trolley carrying a hoist and hook block. The hoist uses a wire rope whose length can be adjusted by raising or lowering the hook, giving direct control over the hoisting height and the resulting wire rope inclination.

The APC930 QHD camera and the Velodyne VLP-16 LiDAR are rigidly mounted on a common support and oriented towards the hoisting area. The sensor rig is underneath the crane runway at an elevated position, providing an oblique view of the trolley–hook region similar to that of a human operator. An overview of the laboratory testbed and sensor mounting configuration is shown in Figure 4. During all experiments, the relative pose of the camera–LiDAR rig remains constant, while the trolley position and hoisting height are varied within the common field of view.

The main quantity of interest in the experiments is the wire rope inclination angle

θ

with respect to the vertical direction. The trolley sensor distance and the trolley–hook distance are recorded as auxiliary variables to analyze how the angle estimation behaves at different positions in the workspace, but they are not the primary performance indicators. To define reference values for each configuration, the crane was first adjusted to a nominally vertical rope configuration

(θ \approx 0 °)

, and then the suspended load was manually displaced to approximate target inclination angles. These target values are treated as reference inclination angles

θ_{r e f}

in the subsequent analysis.

Data collection combines static reference configurations and dynamic motion sequences. In the reference sequences, the trolley is kept stationary while the rope is set to a vertical pose and to several representative inclined poses that span the range of angles encountered in operation. In the dynamic sequences, the trolley moves along the bridge while the hook undergoes either natural small amplitude sway or deliberately excited larger swings. The camera–LiDAR pipeline described in Section 2 runs in real time and produces a time series of inclination estimates

θ_{L i D A R} (t_{k})

together with the corresponding trolley sensor and trolley–hook distances at each perception cycle

t_{k}

. Additional dynamic sequences are captured under reduced illumination by switching off the main ceiling lights and leaving only the fixed safety lighting active. These day and night recordings emulate challenging visibility conditions in real industrial environments and are later used in Section 4.3 to evaluate the robustness of the proposed angle estimator to changes in lighting.

3.2. Dynamic Rope Inclination Evaluation for the Overhead Crane Testbed

This subsection evaluates the behavior of the proposed system when the wire rope is intentionally excited to produce a swinging motion. Instead of analyzing only static poses, the inclination angle

θ (t)

is tracked while the hook and suspended load oscillate around the vertical. The objective is to verify that the LiDAR-based estimate

θ_{L i D A R} (t)

follows the observed swing motion in a consistent and physically meaningful way.

In the induced swing experiments, the trolley is positioned at a specific trolley position, defined as the horizontal distance from the sensor unit, within the field of view of the camera–LiDAR pair, and the hoisting height, defined as the vertical distance between the trolley center and the hook center, is set to a representative operating level. The operator then excites the suspended load so that the hook swings back and forth with clearly visible rope inclination, while the crane structure itself remains stationary or moves only slowly along the bridge. To cover a range of operating conditions, several swing experiments with different excitation strengths are performed. Based on the observed peak inclination angles, the swing amplitudes are categorized as small (

< 10 °

), medium (

10 -

18

°

), and larger (

> 18 °

). For each experiment, multiple swing cycles are recorded while the perception pipeline runs in real time on synchronized camera and LiDAR streams, as described in Section 2.4. At each perception instant t_k, YOLOv9 detections are used to locate the trolley and the hook in the image plane, the associated LiDAR points inside the corresponding bounding boxes are clustered, and the three-dimensional centers of the trolley and the hook are obtained in the LiDAR frame. The instantaneous inclination angle

θ_{L i D A R} (t_{k})

is then computed from the three-dimensional vector between these centers using Equations (20) and (21) and logged together with the trolley sensor distance and hoisting height.

In parallel, a camera-only baseline is computed in the image plane by evaluating, at each frame, the 2D inclination angle

θ_{2 D} (t_{k})

of the line segment joining the trolley and hook bounding box centers, as introduced in Section 2.4. Since both

θ_{L i D A R} (t_{k})

and

θ_{2 D} (t_{k})

are evaluated on exactly the same frames, their trajectories can be compared directly for each swing experiment. In the results section, representative induced swing experiments are illustrated using fused camera images showing YOLOv9 detections, projected LiDAR points, and overlaid values of

θ_{L i D A R}

and

θ_{2 D}

; the remaining configurations are summarized in Table 5, which explicitly lists the specific experimental parameters including trolley position and hoisting height for each case. These observations demonstrate that the proposed camera–LiDAR fusion system can track the swinging motion of the hook reliably, with improved robustness to perspective effects compared with the two-dimensional baseline alone.

3.3. Illumination Robustness Across Angle and Distance Variations

In many factories and port terminals, overhead cranes must operate under dim or uneven lighting. To check whether the proposed fusion system still provides meaningful outputs in such conditions, a subset of the induced swing experiments from Section 3.1 was repeated under reduced illumination around the hoisting area. The lighting level near the trolley–hook region was lowered so that the hook block and wire rope appeared significantly darker in the camera images, while the crane geometry and motion patterns were kept unchanged. Camera exposure and gain settings were not re-tuned, emulating a fixed industrial camera, whereas the LiDAR continued to provide dense 3D point clouds independent of illumination.

During these low-light tests, the full perception pipeline (YOLOv9 detection, LiDAR projection and angle computation) ran in real time, and the corresponding camera images, LiDAR point clouds, and fusion outputs were logged. For a small set of representative trolley positions and swing amplitudes, simple reference values for the rope inclination angle and for the distances from the sensor rig to the trolley and to the hook were prepared from independent measurements and the known crane layout. The LiDAR-based inclination and distance estimates were then compared with these references, and the deviations in angle and distance were used as indicators of geometric consistency under reduced illumination.

In the results section, these comparisons are summarized by reporting the absolute deviations between LiDAR-based and reference values for both normal-light and low-light experiments, together with representative fused images showing the corresponding rope inclination and distance readouts. Overall, the data indicate that the proposed camera–LiDAR fusion system maintains a consistent geometric relationship between trolley, hook, and sensor even when the images become significantly darker, supporting its use for overhead crane monitoring and control in low-illumination environments.

4. Results and Discussion

In the following subsections, the accuracy of the inclination estimates is evaluated using both the per-scenario absolute error and the overall mean squared error (MSE). Across the experiments, fifteen representative static configurations were selected for each operating condition, covering the typical range of rope inclinations and trolley sensor distances observed on the laboratory testbed. For these configurations, the MSE defined below is used as a compact indicator that summarizes the overall angle estimation accuracy and enables a direct comparison between the different methods (camera-only baseline and camera–LiDAR fusion under normal and reduced illumination).

For each configuration

i = 1, \dots, N

, let

θ_{i}^{r e f}

denote the reference rope inclination angle and

{\hat{θ}}_{i}

the angle estimated by the considered method.

The inclination MSE is defined as (22):

{M S E}_{θ} = \frac{1}{N} \sum_{i = 1}^{N} {(θ_{i}^{r e f} - {\hat{θ}}_{i})}^{2}

(22)

In addition to the MSE, the corresponding root mean squared error (RMSE) is also defined as (23):

R {M S E}_{θ} = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(θ_{i}^{r e f} - {\hat{θ}}_{i})}^{2}}

(23)

In the result tables, the columns labelled error (%) report the relative error defined as (24):

{E r r o r}_{i} (%) = 100 % \frac{| θ_{i}^{r e f} - {\hat{θ}}_{i} |}{| θ_{i}^{r e f} |}

(24)

These definitions provide a unified evaluation framework for the subsequent results and discussion. This visual assessment helps to clarify the geometric behavior of the system, to verify the correctness of the camera–LiDAR projection model, and to illustrate the inherent limitations of estimating the rope inclination from 2D images alone. After this qualitative evaluation, the corresponding tables and MSE values are presented to quantify the performance gains achieved by the fusion approach under the different experimental conditions.

4.1. Camera Only Baseline for Rope Inclination Estimation

To highlight the limitations of a purely vision-based approach, this subsection considers a camera-only baseline in which the rope inclination is estimated from YOLOv9 detections without depth information. The same trained YOLOv9 model is applied to RGB images to detect the trolley and hook, and the inclination angle is obtained by joining the centers of their bounding boxes and measuring the deviation of this segment from the vertical image axis. Figure 5 shows snapshots from an experiment where the trolley is kept at a fixed distance while the rope is deliberately excited to swing, producing different apparent tilt angles.

As the swing amplitude increases, the reported camera-only angle broadly follows the observed sway of the hook but exhibits noticeable frame-to-frame fluctuations and a bias near turning points or when the rope is close to vertical, mainly due to finite image resolution and perspective distortion. To quantify these effects, several swing configurations are considered while keeping the trolley position fixed. For each configuration, a reference inclination angle is obtained from crane geometry and independent measurements, and the corresponding camera-only estimates are extracted from the images. Table 5 summarizes the absolute errors, indicating that although the 2D baseline can distinguish between nearly vertical and clearly tilted ropes, the errors remain non-negligible and do not show a consistent reduction across configurations.

To summarize the behavior of the camera-only baseline over the 15 test cases in Table 5, the inclination errors are aggregated using the mean squared error (MSE) metric defined in (22), where

θ_{i}^{r e f}

denotes the manually measured reference angle and

{\hat{θ}}_{i}

is the angle inferred solely from the YOLOv9 bounding box geometry. Substituting the values of Table 5 into (22) and (23) gives

{M S E}_{c a m} = 4.472,

{R M S E}_{c a m} = 2.11

and an average error of 8.34. This relatively large RMSE indicates that the two-dimensional camera-only estimation is quite sensitive to depth ambiguity and perspective distortions in the image plane, so its angle readouts fluctuate noticeably even when the underlying rope configurations are well controlled, motivating the use of camera–LiDAR fusion to obtain more reliable inclination measurements.

4.2. Dynamic Swing Results

The induced swing tests described in Section 3.2 are used to evaluate the dynamic performance of the proposed inclination estimator. For each test, the trolley is positioned at a selected location along the bridge, the hoisting height is kept fixed, and the hook is intentionally excited so that the wire rope oscillates around the vertical. The perception pipeline runs in real time while camera images and LiDAR point clouds are logged, and afterwards the LiDAR-based inclination trajectory

θ_{L i D A R} (t)

is reconstructed from the recorded data. Figure 6 illustrates a representative swing experiment, where the evolution of the estimated angle reflects the expected back-and-forth motion of the hook and shows no obvious inconsistencies or discontinuities.

For quantitative evaluation, a few characteristic instances are selected in each swing experiment. In these instances, approximate reference angles are obtained from crane geometry and independent measurements, and the corresponding

θ_{L i D A R}

values are extracted from the logged trajectory. Table 6 summarises, for all tested configurations, the deviations between the LiDAR-based inclination and the reference angles together with the associated trolley positions, hoisting heights, and swing amplitudes. Although the reference values are only approximate, the deviations reported in Table 6 remain small compared with the overall swing amplitudes, and no systematic trend is observed with respect to trolley sensor distance or hoisting height. These results indicate that the proposed camera–LiDAR fusion method can track the dynamic swing of the wire rope in a geometrically consistent manner across the different operating conditions. For a detailed analysis of the system’s real-time responsiveness, including end-to-end latency and time consistency in these operations, the performance analysis is outlined in Section 4.4.

To obtain a single quantitative indicator of estimator performance under dynamic swing conditions, the mean squared error (MSE) is computed from the fifteen test cases in Table 6 using the MSE and RMSE definitions in equations (22) and (23) where

θ_{i}^{r e f}

and

{\hat{θ}}_{i}

denote the reference and estimated inclination angles for scenario i. Based on the angular deviations in Table 6, the resulting error metrics are

{M S E}_{f u s i o n} = 0.425,

{R M S E}_{f u s i o n} = 0.65

and an average error of 2.84.

For comparison, the camera-only baseline in Section 4.1 yields larger error, with

{M S E}_{c a m} = 4.472,

{R M S E}_{c a m} = 2.11

and an average error of 8.34, due to the absence of depth information under non-stationary swing motion. These results quantitatively confirm that incorporating LiDAR depth improves robustness and reduces estimation error compared with image-only methods.

4.3. Effect of Illumination on Inclination and Distance Estimates

The low-illumination experiments introduced in Section 4.3 are used to assess how sensitive the proposed fusion system is to changes in lighting. For a subset of trolley positions and swing amplitudes, measurements are repeated under normal indoor lighting and under reduced illumination around the hoisting area, while the perception pipeline and sensor configuration are kept identical. In the reduced illumination setting, the illuminance at the hoisting area is approximately 203 lux, providing a quantitative definition of the low-light condition and reflecting practical crane environments such as partially shaded indoor bays or nighttime operation. For each tested configuration, the LiDAR-based inclination angle and the sensor trolley and sensor hook distances are compared with the corresponding reference values obtained from the crane geometry and independent measurements. The absolute deviations for all configurations are summarized in Table 7, grouped by lighting condition, and Figure 7 shows representative fused images for the matched normal-light and low-light cases together with the associated numerical readouts.

Across the configurations considered, the inclination deviations under low illumination remain of the same order of magnitude as those obtained under normal lighting, and no systematic increase in error is observed. The distance deviations for the trolley and the hook show a similar behavior variation between normal-light and low-light runs are small compared with the absolute distances and do not follow a consistent degrading trend. These results indicate that once the objects have been correctly detected by YOLOv9, the LiDAR-based geometric quantities produced by the camera–LiDAR fusion pipeline are largely insensitive to the brightness of the camera images. This behavior is expected because the inclination and distance estimates are dominated by LiDAR depth measurements, which are inherently insensitive to illumination variations, while the camera is mainly used for object localization. In practice, this suggests that the proposed system can be deployed in overhead crane applications where lighting conditions vary significantly without requiring the continuous retuning of the perception module. In particular, Figure 7 shows the corresponding results under reduced illumination the similarity between the numerical outputs in both cases further supports the robustness of the system against lighting variations.

Using the same definition of the mean squared error (MSE) as in (22), the angle estimation accuracy of the fusion system under reduced illumination is quantified from the fifteen cases in Table 7. Let

θ_{i}^{r e f}

and

{\hat{θ}}_{i}

denote the reference and estimated rope angles for case

i

; the MSE is then obtained by averaging the squared angular deviations

{(θ_{i}^{r e f} - {\hat{θ}}_{i})}^{2}

over all cases. Applying this metric to the data in Table 7 yields

{M S E}_{L L} = 0.431,

{R M S E}_{L L} = 0.66

and an average error of 4.20. Table 8 provides a direct comparison of the MSE, RMSE, and average error under normal and reduced illumination.

As shown in Table 8, although the illuminance at the hoisting area dropped significantly to approximately 203 lux, the increase in estimation error was statistically negligible (RMSE increased by only 0.01°, from 0.65° to 0.66°). This quantitative comparison confirms that there is no strong correlation between ambient light intensity and the system’s geometric accuracy, validating the robustness of the LiDAR-driven fusion architecture against illumination variations.

4.4. Computational Performance and Timing Analysis

To assess the feasibility of the proposed system for real-time monitoring, the runtime performance was evaluated on a desktop computer equipped with an NVIDIA GPU. The end-to-end processing latency is defined as the time interval from the reception of raw sensor frames to the output of the calculated inclination angle. To ensure statistical reliability, the execution time was measured over 1050 consecutive frames. Table 9 summarizes the timing performance breakdown.

The results indicate that the YOLOv9 inference stage consumes the largest portion of the computational budget (

\approx

124 ms). The total average processing time is 145.23 ms, corresponding to an operating frequency of approximately 6.89 Hz. The timing jitter, quantified by the standard deviation is 43.30 ms, which is attributed to the non-real-time scheduling of the host operating system and variable GPU inference loads.

Regarding the system’s feasibility for industrial application, a comparative analysis with typical control requirements is necessary. While high-speed autonomous driving applications typically require 30 Hz or more, the dynamics of overhead cranes are characterized by low-frequency oscillations. The natural period of a suspended load swing typically ranges from 3 to 10 s (0.1–0.33 Hz). According to the Nyquist–Shannon sampling theorem, the achieved sampling rate of 6.89 Hz is more than 20 times higher than the signal bandwidth. This ensures sufficient temporal resolution to reconstruct the swing trajectory accurately and support anti-sway control loops without aliasing. Furthermore, the system latency of

\approx

145 ms is comparable to the typical reaction time of a human crane operator, confirming the system’s applicability for operator assist and automated monitoring tasks.

5. Discussion

5.1. Advantages

The proposed fusion framework demonstrates a marked advantage in accuracy, achieving a consistent RMSE of approximately 0.65

°

across diverse operating scenarios. Relative to the camera-only baseline (RMSE ≈ 2.11°), the fusion approach effectively eliminates depth ambiguity by grounding the estimation in metric LiDAR data. A key strength of the system is its robustness to illumination changes as evidenced by the low-light experiments, the estimation accuracy remains stable even when ambient light drops to ~203 lux, a critical factor for 24/7 port operations. Furthermore, the system achieves an operating frequency of 6.89 Hz with a latency of ~145 ms, which satisfies the Nyquist criterion for monitoring typical crane sway dynamics (0.1–0.33 Hz), thereby laying the groundwork for real-time automated safety monitoring.

5.2. Challenges

Although the system excels in accuracy and lighting robustness, several practical challenges remain for large-scale deployment. First, the current experiments were conducted in a controlled indoor environment, the impact of harsh outdoor maritime weather such as heavy rain, dense fog, or snow on LiDAR point cloud quality has not yet been fully quantified. Second, overhead cranes are subject to continuous mechanical vibration and shock loads during container handling, which may cause the extrinsic calibration between the camera and LiDAR to drift over time, potentially requiring frequent recalibration. Third, while the current inference speed is sufficient for monitoring, the reliance on a high-performance GPU workstation increases hardware costs and power consumption, which may be a constraint for retrofitting older cranes with limited power infrastructure. Finally, the validation was performed on a single laboratory testbed, generalization to different crane types with varying hoisting heights and structural geometries warrants further validation prior to broad commercial deployment.

5.3. Future Perspectives

To overcome these limitations and broaden practical applicability, future work can proceed in four directions:

Enhance environmental robustness: Investigate advanced point cloud filtering algorithms or multimodal fusion to maintain reliable detection under severe weather conditions like heavy rain or fog.
Long-term stability and self-calibration: Develop online or targetless self-calibration routines that can detect and compensate for extrinsic parameter drift caused by mechanical vibrations, ensuring maintenance-free long-term operation.
Deployment and lightweight optimization: Apply model pruning, quantization and knowledge distillation to optimize the YOLOv9 model for deployment on resource-constrained edge devices (such as NVIDIA Jetson), reducing hardware costs and energy consumption.
Closed-loop control integration: Directly stream the real-time inclination estimates to the crane’s PLC (Programmable Logic Controller) to validate the system’s effectiveness in active anti-sway control loops and automated positioning tasks in real-world operational cycles.
High-precision ground truth validation: Future experiments will incorporate industrial-grade inertial measurement units (IMUs) and encoder-based motion references to provide absolute ground truth, allowing for a more rigorous quantification of dynamic tracking errors and time alignment.

6. Conclusions

A YOLOv9-based camera–LiDAR fusion framework for real-time estimation of wire rope inclination in overhead cranes has been presented. The approach combines a monocular industrial camera, which provides semantic localization of the trolley and hook, with a 3D LiDAR sensor that supplies metrically accurate depth information. After extrinsic calibration, LiDAR points are projected into the image plane, filtered using YOLOv9 bounding boxes, and used to compute 3D centers and the trolley–hook vector directly in the LiDAR frame. This design grounds the inclination estimates in 3D geometry rather than purely in 2D image features.

Static and dynamic experiments on a laboratory overhead crane testbed indicate that the proposed estimator tracks swing motion of the hook in a geometrically consistent manner across different inclination angles, trolley sensor distances, and lighting conditions (daytime and nighttime). Compared with a camera-only baseline, the fusion scheme significantly reduces the inclination error and achieves angle RMSE values around one degree, while maintaining real-time operation.

From an application perspective, these results suggest that the method can support collision avoidance, safety zone monitoring, and future closed-loop sway control functions in automated ship-to-shore crane systems, where robustness to perspective effects, partial occlusions, and illumination changes is essential. While PID-based crane automation is commonly designed around accurate inclination or angle sensors, such sensors can be difficult or impractical to install and maintain on large scale cranes operating in harsh industrial environments. In this context, the proposed camera–LiDAR fusion framework provides a noncontact and infrastructure light alternative for real-time inclination feedback, which can complement or replace dedicated angle sensors in closed-loop sway control architectures.

Overall, the work provides a principled and experimentally validated approach to wire rope inclination estimation using YOLOv9-based camera–LiDAR fusion, bridging the gap between generic fusion techniques developed for road traffic scenarios and the specific perception needs of overhead and ship-to-shore crane automation in seaport and maritime logistics environments.

Author Contributions

Conceptualization, A.-H.P. and S.-J.L.; methodology, A.-H.P. and B.-S.G.; validation, S.-J.L. and B.-S.G.; resources, S.-J.L.; data curation, A.-H.P.; G.-E.J. and X.-K.M.; writing—original draft preparation, A.-H.P.; writing—review and editing, S.-J.L.; visualization, B.-S.G.; supervision, S.-J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This study was conducted as part of the Glocal University Project, supported by the RISE (Regional Innovation System & Education) program funded by the Ministry of Education. This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (RS-2025-25396743).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

AI	artificial intelligence
MSE	mean squared error
RMSE	root mean squared error
ROI	region of interest
LiDAR	Light Detection and Ranging
RGB	red–green–blue
YOLO	You Only Look Once

References

Jobran, Y.; Kara, G. Examining the Efficiency of Automation in Container Terminals. J. Transp. Logist. 2022, 7, 137–155. [Google Scholar] [CrossRef]
Tsagkaris, P.; Moschovou, T.P. The Impact of Automation on the Efficiency of Port Container Terminals. Future Transp. 2025, 5, 155. [Google Scholar] [CrossRef]
AlKheder, S.; Naif, D.; Musaed, D.; Al Shrekah, S.; Al Rshaid, M.; Al. Anzi, N.; Baqer, I. Maritime Transport Management in Kuwait toward an Automated Port Logistical City. Clean. Logist. Supply Chain 2022, 3, 100031. [Google Scholar] [CrossRef]
Gutierrez, R.; Magallon, M.; Hernandez, D.C. Vision-Based System for 3d Tower Crane Monitoring. IEEE Sens. J. 2021, 21, 11935–11945. [Google Scholar] [CrossRef]
Gan, G.Y.; Lee, H.S.; Tao, Y.J.; Tu, C.S. Selecting Suitable, Green Port Crane Equipment for International Commercial Ports. Sustainability 2021, 13, 6801. [Google Scholar] [CrossRef]
Pallis, P.L. Port Risk Management in Container Terminals. Transp. Res. Procedia 2017, 25, 4411–4421. [Google Scholar] [CrossRef]
Raubar, E.; Vrančić, D. Anti-Sway System for Ship-to-Shore Cranes. Stroj. Vestn./J. Mech. Eng. 2012, 58, 338–344. [Google Scholar] [CrossRef]
Wang, H.; Yang, Q.; Liu, Q.; Zhao, C.; Zhou, W.; Zhang, H.; Liu, J. Automated Productivity Analysis of Cable Crane Transportation Using Deep Learning-Based Multi-Object Tracking. Autom. Constr. 2024, 166, 105644. [Google Scholar] [CrossRef]
Munro, D.M.; Govers, M.E.; Oliver, M.L. Physical demands of overhead crane operation. Int. J. Ind. Ergon. 2021, 86, 103200. [Google Scholar] [CrossRef]
Jakovlev, S.; Eglynas, T.; Voznak, M. Application of neural network predictive control methods to solve the shipping container sway control problem in quay cranes. IEEE Access 2021, 9, 78253–78265. [Google Scholar] [CrossRef]
Cahyono, R.T.; Kenaka, S.P.; Jayawardhana, B. Simultaneous allocation and scheduling of quay cranes, yard cranes, and trucks in dynamical integrated con-tainer terminal operations. IEEE Trans. Intell. Transp. Syst. 2021, 23, 8564–8578. [Google Scholar] [CrossRef]
Zhong, B.; Ma, L.; Dong, H. Dynamic Analysis and Structure Parameter Research on a Hydraulic Anti-Swaying System for Container Cranes. Appl. Sci. 2022, 12, 12537. [Google Scholar] [CrossRef]
Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo Algorithm Developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Yaseen, M. What Is YOLOv9: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector. arXiv 2024, arXiv:2409.07813. [Google Scholar]
Viswanatha, V.; Chandana, R.K.; Ramachandra, A.C. Real time object detection system with YOLO and CNN models: A review. arXiv 2022, arXiv:2208.00773. [Google Scholar] [CrossRef]
Shao, Y.; Sun, Z.; Tan, A.; Yan, T. Efficient three-dimensional point cloud object detection based on improved Complex-YOLO. Front. Neurorobot. 2023, 17, 1092564. [Google Scholar] [CrossRef]
Liang, M.; Yang, B.; Wang, S.; Urtasun, R. Deep Continuous Fusion for Multi-Sensor 3D Object Detection. In Proceedings of the European Conference on Computer Vision (ECCV) 2018; Springer: Cham, Switzerland, 2018; pp. 641–656. [Google Scholar]
Simon, M.; Milz, S.; Amende, K.; Gross, H.-M. Complex-YOLO: Real-Time 3D Object Detection on Point Clouds. arXiv 2018, arXiv:1803.06199. [Google Scholar]
Xu, X.; Jiang, Y.; Chen, W.; Huang, Y.; Zhang, Y.; Sun, X. Damo-yolo: A report on real-time object detection design. arXiv 2022, arXiv:2211.15444. [Google Scholar]
Vora, S.; Lang, A.H.; Helou, B.; Beijbom, O. PointPainting: Sequential Fusion for 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2020; IEEE: New York, NY, USA, 2020; pp. 4604–4612. [Google Scholar]
Yildiz, A.S.; Meng, H.; Swash, M.R. Real-Time Object Detection and Distance Measurement Enhanced with Semantic 3D Depth Sensing Using Cam-era-LiDAR Fusion. Appl. Sci. 2025, 15, 5543. [Google Scholar] [CrossRef]
Liu, H.; Wu, C.; Wang, H. Real Time Object Detection Using LiDAR and Camera Fusion for Autonomous Driving. Sci. Rep. 2023, 13, 8056. [Google Scholar] [CrossRef]
Benayed, W.; Mabrouk, I.; Masmoudi, M.S.; Abdelaziz, W.B. LiDAR 2D and Camera Fusion for ADAS: A Practical Approach With YOLOv9 and ROS2 Framework. IEEE Access 2025, 13, 155500–155519. [Google Scholar] [CrossRef]
Tsai, D.; Worrall, S.; Shan, M.; Lohr, A.; Nebot, E. Optimising the Selection of Samples for Robust Lidar Camera Calibration. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC) 2021; IEEE: New York, NY, USA, 2021; pp. 2631–2638. [Google Scholar]
Li, L.; Li, H.; Liu, X.; He, D.; Miao, Z.; Kong, F.; Li, R.; Liu, Z.; Zhang, F. Joint Intrinsic and Extrinsic LiDAR-Camera Calibration in Targetless Environments Using Plane-Constrained Bundle Adjustment. arXiv 2023, arXiv:2308.12629. [Google Scholar]
Huang, J.K.; Grizzle, J.W. Improvements to Target-Based 3D LiDAR to Camera Calibration. IEEE Access 2020, 8, 134101–134110. [Google Scholar] [CrossRef]
Zeng, T.; Gu, X.; Yan, F.; He, M.; He, D. YOCO: You Only Calibrate Once for Accurate Extrinsic Parameter in LiDAR-Camera Systems. Meas. Sci. Technol. 2025, 36, 075009. [Google Scholar] [CrossRef]
Zhang, H.; Li, S.; Zhu, X.; Chen, H.; Yao, W. 3D LiDAR and monocular camera calibration: A Review. IEEE Sens. J. 2025, 25, 10530–10555. [Google Scholar] [CrossRef]
Peterle, A.C.F.; Araujo, R.J.T.; Silva, P.I.M. Three-Dimensional Collision Avoidance System Applied to Ship-Loaders Based on GNSS and Point Cloud Mod-eling: A Case Study. Congr. Bras. Autom.-CBA 2024, 4, CBA2024. [Google Scholar]
Klar, R.; Fredriksson, A.; Angelakis, V. Digital Twins for Ports: Derived from Smart City and Supply Chain Twinning Experience. IEEE Access 2023, 11, 71777–71799. [Google Scholar] [CrossRef]

Figure 1. Sensor configuration used in the overhead crane laboratory: (a) APC930 QHD webcam; (b) Velodyne VLP-16 LiDAR sensor.

Figure 2. Schematic illustration of the camera–LiDAR calibration setup.

Figure 3. Real-time camera–LiDAR fusion pipeline for wire rope inclination estimation.

Figure 4. Laboratory overhead crane testbed and rigidly mounted camera–LiDAR fusion.

Figure 5. Camera only inclination estimation during intentional rope swing using YOLOv9 bounding boxes.

Figure 6. Estimated results of the camera–LiDAR fusion system with distance trolley sensors.

Figure 7. Fusion results under reduced illumination.

Table 1. Qualitative comparison of existing crane monitoring approaches versus the proposed framework.

	Installation and Maintenance	Environmental Robustness	3D Metric Accuracy	Real-Time and Suitability
Traditional Sensors (Encoders/IMUs)	High (installation burden)	Medium	High	High
Monocular Vision (Camera Only)	Low complexity	Low (light)	Low (depth)	High
Generic Fusion (Auto-Driving)	Medium	High	High	Medium
Proposed Framework (Crane-Specific)	Low/Medium	High	High	High

Table 2. Specifications of the APC930 QHD camera and Velodyne VLP-16 LiDAR sensors.

Sensor	Parameter	Value/Setting
APC930 QHD camera	Image format	RGB color video
	Resolution	$2592 \times 1944$ pixels (QHD)
	Horizontal field of view	$\approx$ $80 °$
	Frame rate	30 fps
Velodyne VLP-16 LiDAR	Number of laser channels	16 channels
	Measurement range	Up to 100 m
	Horizontal field of view	$360 °$
	Vertical field of view	$30 °$ $(from - 15 °$ $to + 15 °$ )
	Rotation rate	5–20 Hz

Table 3. Camera acquisition and processing settings.

Item	Value/Description
Native calibration resolution	$3260 \times 1832$ pixels
Processing resolution	$1280 \times 720$ pixels
YOLOv9 input	$RGB frames at 640 \times 640$
Intrinsic matrix	$K$ $obtained by scaling K_{f u l l}$ with resize factors

Table 4. Summary of dataset details, hardware specifications and training hyperparameters.

Category	Parameter	Value/Description
Dataset	Total images	1000 images (Annotated via Roboflow)
Dataset	Data split	Training 70%; validation 20%; test 10%
Hardware	CPU	Intel(R) Xeon(R) W-2295 @ 3.00 GHz
	RAM	512 GB
	GPU	NVIDIA Quadro P2200 (5 GB VRAM)
Training Configuration	Epochs/Batch size/Optimizer	100/16/SGD
	Strategy	Transfer learning (Fine-tuned from pretrained weights)
	Photometric augmentation	HSV gains (H: 0.015, S: 0.7, V: 0.4)
Inference Settings	Confidence threshold	0.5 (default)
Inference Settings	IoU threshold	0.5 (default)

Table 5. Camera only baseline results; estimated rope inclination angles and absolute errors.

Case	Trolley Position (m)	Hoisting Height (m)	Swing Amplitude	$Estimated Angle (°)$	$Real Angle (°)$	$Error (%)$
1	7.90	1.30	Large	20.84	23.63	12.87
2	7.90	1.30	Large	20.84	24.62	15.35
3	7.90	1.50	Large	30.83	32.12	4.02
4	7.90	1.50	Medium	13.48	14.43	6.58
5	7.29	1.50	Large	26.05	26.57	1.96
6	7.29	1.50	Large	25.09	26.05	3.69
7	7.29	2.00	Large	21.07	19.02	10.78
8	7.29	2.00	Large	22.24	26.12	14.85
9	7.29	2.00	Large	32.14	34.27	6.21
10	6.70	1.50	Large	20.00	22.52	11.19
11	6.70	1.50	Large	26.41	27.36	3.47
12	6.70	1.50	Medium	16.02	15.06	6.37
13	6.52	2.50	Small	10.75	9.43	12.24
14	6.52	2.50	Medium	14.09	15.12	6.81
15	6.52	2.50	Large	22.26	24.37	8.66

Table 6. Estimated and reference rope inclination angles and corresponding errors for the dynamic swing test scenarios.

Case	Trolley Position (m)	Hoisting Height (m)	Swing Amplitude	$Estimated Angle (°)$	$Real Angle (°)$	$Error (%)$
1	7.90	1.30	Large	28.50	27.72	2.81
2	7.90	1.30	Large	27.16	26.78	1.41
3	7.90	1.50	Large	23.70	25.60	7.42
4	7.90	1.50	Medium	16.39	16.70	1.85
5	7.29	1.50	Medium	16.55	17.02	2.76
6	7.29	1.50	Medium	10.78	10.24	5.27
7	7.29	2.00	Medium	16.77	16.71	0.35
8	7.29	2.00	Medium	17.36	17.28	0.46
9	7.29	2.00	Medium	17.85	17.64	1.19
10	6.70	1.50	Large	18.02	18.13	0.61
11	6.70	1.50	Large	19.68	20.41	3.58
12	6.70	1.50	Small	10.29	9.78	5.21
13	6.52	2.50	Small	5.92	5.87	0.85
14	6.52	2.50	Small	7.69	7.31	5.20
15	6.52	2.50	Medium	16.78	17.41	3.62

Table 7. Angle estimation accuracy of the camera–LiDAR fusion under reduced illumination.

Case	Trolley Position (m)	Hoisting Height (m)	Swing Amplitude	$Estimated Angle (°)$	$Real Angle (°)$	$Error (%)$
1	7.70	1.70	Medium	12.67	12.21	3.75
2	7.70	1.70	Medium	16.77	16.48	1.76
3	7.70	1.70	Small	5.92	5.46	7.73
4	7.70	1.70	Medium	16.78	16.35	2.56
5	7.30	2.00	Large	19.02	18.63	2.09
6	7.30	2.00	Large	19.35	20.61	4.51
7	7.30	2.00	Medium	10.29	9.70	6.08
8	7.30	2.50	Small	5.72	5.87	2.56
9	7.30	2.50	Small	7.69	7.25	6.07
10	6.60	1.00	Medium	16.78	17.40	3.56
11	6.60	1.00	Large	17.68	18.31	3.44
12	6.60	1.50	Small	8.29	8.68	4.49
13	6.60	1.50	Small	5.92	5.77	2.60
14	6.60	2.50	Small	7.96	7.62	4.46
15	6.60	2.50	Medium	14.67	13.67	7.32

Table 8. Comparison of angle estimation accuracy of the fusion method under different illumination conditions.

Illumination Condition	MSE	RMSE	$Average Error (%)$
Normal illumination	0.425	0.65	2.84
Reduced illumination	0.431	0.66	4.20

Table 9. Breakdown of average processing time and timing stability over 1050 frames.

Pipeline Stage	Mean Time (ms)	Standard Deviation (ms)
Image Preprocessing	11.81	1.94
YOLOv9 Inference	124.13	43.00
LiDAR Projection and Algorithm	9.29	7.19
Total End-to-End Latency	145.23	43.30

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pham, A.-H.; Jung, G.-E.; Mai, X.-K.; Go, B.-S.; Lee, S.-J. Real-Time Wire Rope Inclination Detection Using YOLOv9-Based Camera–LiDAR Fusion for Overhead Cranes. J. Mar. Sci. Eng. 2026, 14, 393. https://doi.org/10.3390/jmse14040393

AMA Style

Pham A-H, Jung G-E, Mai X-K, Go B-S, Lee S-J. Real-Time Wire Rope Inclination Detection Using YOLOv9-Based Camera–LiDAR Fusion for Overhead Cranes. Journal of Marine Science and Engineering. 2026; 14(4):393. https://doi.org/10.3390/jmse14040393

Chicago/Turabian Style

Pham, Anh-Hung, Ga-Eun Jung, Xuan-Kien Mai, Byeong-Soo Go, and Seok-Ju Lee. 2026. "Real-Time Wire Rope Inclination Detection Using YOLOv9-Based Camera–LiDAR Fusion for Overhead Cranes" Journal of Marine Science and Engineering 14, no. 4: 393. https://doi.org/10.3390/jmse14040393

APA Style

Pham, A.-H., Jung, G.-E., Mai, X.-K., Go, B.-S., & Lee, S.-J. (2026). Real-Time Wire Rope Inclination Detection Using YOLOv9-Based Camera–LiDAR Fusion for Overhead Cranes. Journal of Marine Science and Engineering, 14(4), 393. https://doi.org/10.3390/jmse14040393

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Real-Time Wire Rope Inclination Detection Using YOLOv9-Based Camera–LiDAR Fusion for Overhead Cranes

Abstract

1. Introduction

2. Materials and Methods

2.1. Coordinate Frames of Camera and LiDAR

2.2. Sensor Specifications and Mounting Configuration

2.3. Camera–LiDAR Calibration

2.4. Real-Time Data Acquisition and Synchronization

2.5. YOLOv9-Based Trolley and Hook Detection

2.6. LiDAR-Based 3D Center Estimation and Wire Rope Inclination Computation

3. Experimental Setup and Test Scenarios

3.1. Overhead Crane Testbed and Data Collection Protocol

3.2. Dynamic Rope Inclination Evaluation for the Overhead Crane Testbed

3.3. Illumination Robustness Across Angle and Distance Variations

4. Results and Discussion

4.1. Camera Only Baseline for Rope Inclination Estimation

4.2. Dynamic Swing Results

4.3. Effect of Illumination on Inclination and Distance Estimates

4.4. Computational Performance and Timing Analysis

5. Discussion

5.1. Advantages

5.2. Challenges

5.3. Future Perspectives

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI