1. Introduction
Optical Camera Communication (OCC) is an important branch of Optical Wireless Communication (OWC) [
1,
2]. It employs Light-Emitting Diode (LED) arrays as transmitters and cameras as receivers. Compared with radio-frequency communication, OCC offers abundant spectrum resources, immunity to electromagnetic interference, and the ability to serve illumination simultaneously. In recent years, OCC has attracted wide attention in indoor positioning [
3], vehicular networks [
4], and Internet of Things sensing [
5,
6]. In 2018, the IEEE 802.15.7-2018 standard incorporated OCC into the short-range optical wireless communication specification system [
7]. Visible Light Communication (VLC) systems use photodiodes as receivers and demodulate continuous photoelectric signals directly. In contrast, OCC receivers must first accomplish spatial localization of LED elements from two-dimensional images and then recover data through color or brightness identification [
8]. For OCC systems based on LED arrays, the element density is high and each LED element occupies a small image area. Existing research indicates that spatial localization error is a significant factor affecting OCC system reliability [
9], and the limited spatial resolution is considered one of the primary challenges [
10].
In practical deployment, it is difficult to maintain strict perpendicular alignment between the camera optical axis and the LED array plane. Two common types of geometric distortion pose serious threats to localization accuracy. The first is perspective tilt around the vertical axis (Yaw), which causes the originally rectangular LED panel to exhibit trapezoidal distortion in the image, where the far end appears smaller. Under this distortion, the row and column spacing of LED elements is no longer uniform, and the imaging size and brightness of elements at the far end are noticeably smaller than those at the near end [
11]. The second is in-plane rotation of the LED panel (Roll), which introduces an angle between the LED row direction and the horizontal reference line of the image [
12]. These two distortions often occur simultaneously. The spatial distribution of LED elements is then jointly affected by perspective distortion and rotational transformation, making it impossible for a fixed uniform grid to align with the actual element positions. Consequently, conventional schemes based on fixed-grid sampling produce systematic center deviations, leading to large-area color misclassification and, in severe cases, complete failure of communication synchronization [
13].
To address the two major challenges of LED array localization and geometric correction, researchers have explored the problem from multiple perspectives. Early methods relied on techniques such as grayscale thresholding and circularity filtering to extract LED contours. These methods are effective under normal alignment conditions, but when distortion increases or background illumination becomes complex, the spot morphology undergoes severe deformation, and the accuracy of contour extraction degrades significantly [
14]. With breakthroughs in deep learning for object detection, models such as YOLO and SSD have been introduced to replace manual ROI selection [
15,
16]. Such detectors output axis-aligned horizontal bounding boxes (HBB). When the LED panel undergoes in-plane rotation, the upper and lower edges of the horizontal box will include stray pixels from the desktop or walls [
17]. These additional background pixels generate dense noise in the subsequent thresholding and contour detection steps, severely reducing the signal-to-noise ratio of valid LED elements. OBB predicts quadrilaterals with rotation angles, whose four edges tightly enclose the physical boundaries of the target. Therefore, OBB holds a principled advantage in scenarios involving both rotation and perspective distortion [
18]. However, research on OCC reception systems that deeply integrate OBB detection with geometric correction and real-time decoding remains scarce.
In the geometric correction stage, the fundamental tool for removing perspective distortion is the planar homography transformation [
19]. Traditional approaches compute the perspective matrix in a single step using the four corners of the detection box or the LED panel boundary, mapping the distorted quadrilateral to a standard rectangle [
20,
21]. This single-stage strategy performs reasonably well under small distortions. When the tilt angle increases or multiple distortions superimpose, however, the coarse corner information struggles to achieve sufficient alignment accuracy, and the positional deviation continuously degrades color recognition accuracy [
22]. K-Means clustering, as a lightweight unsupervised learning method, has demonstrated promising real-time potential in LED array topology construction [
23,
24]. Nevertheless, if the entire pipeline from LED detection to cluster-based localization is executed in every frame, the cumulative time severely limits the maximum achievable communication frame rate [
25]. Some studies have attempted to reduce the computational load by tracking LED positions using template matching or Kalman filtering [
26,
27]. These methods require independent modeling for each LED element, leading to high system complexity and tracking drift under severe distortion.
Various strategies have been explored to enhance the geometric robustness of OCC systems. In 3D modeling, Liu et al. [
28] remained confined to simulations, while Palitharathna et al. [
29] relied on hardware-intensive liquid lenses. Other receiver-side detection or tracking schemes (Sitanggang et al. [
30]; Shin and Jang [
31]; Song et al. [
32]) typically overlook composite perspective and in-plane rotation distortions. Meanwhile, transmitter-side pre-correction (Zhang et al. [
33]) requires signal modifications, limiting its generalizability across unmodified LEDs. Consequently, a practical, software-only receiver architecture that actively rectifies composite Yaw-Roll distortions is still lacking. To address this gap, we propose a robust software-defined OCC receiver integrating YOLOv8n-OBB detection, cascaded two-stage homography correction, and K-Means grid locking. The main contributions of the system are as follows:
First, a YOLOv8n-OBB rotated object detector is introduced to extract a clean ROI that contains almost no background pixels in composite distortion scenarios. Experiments show that under extreme conditions of 45° tilt combined with in-plane rotation, the OBB scheme operates stably, whereas the horizontal-box scheme fails in detection due to excessive background noise.
Second, a cascaded framework of coarse-to-fine correction is designed. Two homography matrices are multiplied to obtain a unified combined transformation, achieving correction accuracy through a single mapping. After correction, the LED grid achieves sub-pixel alignment.
Third, K-Means clustering is adopted to construct a 16 × 16 LED topological grid, and a locking mechanism is introduced. After locking, subsequent frames skip grayscale segmentation and repeated clustering, significantly reducing the computational time and lowering the per-frame processing latency to the millisecond level.
Fourth, the proposed scheme is benchmarked against mainstream deep-learning-based OCC architectures under identical channel conditions and compared quantitatively with recent SOTA works, validating its superior robustness and scientific advancement.
The rest of this paper is organized as follows:
Section 2 describes the transmitter frame structure and the receiver processing pipeline.
Section 3 presents the core algorithm design.
Section 4 reports the experimental setup and comparative result analysis.
Section 5 concludes the paper.
2. System Model and Working Principle
2.1. Transmitter Design and Frame Structure
The transmitter employs a 16 × 16 RGB LED array (WS2812, WorldSemi Co., Ltd., Shenzhen, China) with a total of 256 independently controllable light-emitting units. Each data LED can be individually set to one of four colors: red (R), green (G), blue (B), or white (W). These four colors correspond to 2-bit information, as follows: 00 → R, 01 → G, 10 → B, and 11 → W. Calibration LEDs use green (G) and blue (B), each representing 1 bit (G → 0, B → 1). Page-number LEDs use red (R) and green (G), each representing 1 bit (R → 0, G → 1). The array is divided into 15 pure data rows and one hybrid function row (the 16th row), as illustrated in
Figure 1.
Each row carries four ASCII characters and is divided into two coding groups. Each group uses seven consecutive LEDs to encode two characters. The most significant bit of each standard ASCII character is truncated because it always represents 0, and then seven bits are left. The two resulting 7-bit sequences together form a 14-bit data stream. Starting from the most significant bit, every two adjacent bits are mapped to the color of one LED in the group. This interleaved coding scheme allows seven LEDs to carry two characters simultaneously. At the end of each group (columns 8 and 16), a check LED is placed. If the total number of logic “1” bits among the 14 data bits is even, the check LED displays green (G); otherwise, it displays blue (B).
The 16th row is a hybrid function row. The first seven LEDs carry the last two characters using the same coding scheme. The eighth LED serves as the check bit. LEDs in columns 9 through 16 represent an 8-bit binary page number, using only red (R) and green (G).
The transmitter refreshes at a frame rate of 5 fps with a frame dwell time of 0.2 s. Each frame transmits a total of 473 bits: data LEDs account for 434 bits (217 LEDs × 2 bits each); parity-check LEDs account for 31 bits (31 LEDs × 1 bit each); and page-number LEDs account for 8 bits (8 LEDs × 1 bit each). The total throughput (physical layer) is 2365 bps, calculated as 473 bits/frame × 5 frames/s. During BER evaluation, the transmitter cyclically sends a fixed, known test sequence. The receiver pre-calculates the expected color of every LED according to the coding rules and constructs a ground-truth matrix. The cumulative BER is then obtained by comparing each frame against this ground truth on an element-by-element basis. This approach prevents the BER evaluation from being coupled to multi-page message decoding logic, so that the results purely reflect the performance of the physical channel and the receiver algorithm.
2.2. Receiver Processing Pipeline
The receiver uses a Hikrobot MV-CU120-10UC industrial camera (Hikrobot Co., Ltd., Hangzhou, China). The image acquisition resolution is set to 640 × 640 pixels, and the camera is connected to a computer via a USB 3.0 interface. The software environment is based on Python 3.9. Image processing relies on the OpenCV 4.11.0 and scikit-learn libraries; deep learning inference is based on the Ultralytics YOLOv8 framework.
The receiver processing pipeline is shown in
Figure 2. To reduce the computational overhead, the system adopts a two-stage execution model. The initial calibration stage builds a stable LED grid. Once the “locked” stage is entered, the grid and transformation matrices are reused, and several modules are bypassed, thereby substantially shortening the per-frame processing time.
During the calibration stage, each raw image is first fed into the YOLOv8n-OBB model, which outputs the four corner coordinates of the oriented bounding box of the LED array. The system then crops the ROI accordingly. A coarse perspective correction is subsequently performed on the ROI using the OBB corner points to re-align the tilted LED panel. After the coarse correction, the first LED center detection is executed. Sub-pixel center coordinates are obtained through grayscale thresholding, contour extraction, and circularity filtering. The fine-correction module then uses these centers to compute a second homography matrix. This matrix is multiplied by the coarse-correction matrix to obtain the combined transformation matrix
. A second LED center detection is then performed on the fine-corrected image, yielding a more accurate set of center coordinates. These coordinates are fed into K-Means clustering to build a 16 × 16 LED topological grid. Once the grid passes quality validation, it is locked. From this point onward, all of the modules enclosed by the two dashed boxes in
Figure 2 are skipped. These modules include YOLO inference, coarse correction, the two LED center detection passes, fine correction, matrix concatenation, and K-Means clustering. ROI cropping continues to execute, but uses the previously cached OBB coordinates and no longer relies on new YOLO detections. The combined matrix
and the grid
G are read directly from the cache modules on the right side and applied to every subsequent frame.
The color classification module samples HSV values at each grid intersection after locking. A cascaded decision tree based on the HSV color space is adopted. The first level eliminates unlit LEDs according to their brightness. The second level separates white LEDs from colored LEDs according to saturation. The third level classifies colored LEDs as red, green, or blue according to their hue. A 16 × 16 color matrix, i.e., the decoded data frame, is thus obtained. The recognition result of each frame is compared element-by-element with the ground-truth matrix, and the cumulative BER and frame error rate are updated in real time.
The processing paths before and after locking differ substantially. Before locking, every frame requires YOLO inference, two perspective transformations, grayscale segmentation, contour filtering, two-dimensional clustering, and color classification. After locking, only ROI cropping (using cached coordinates), a single combined-matrix mapping, fixed-grid sampling, and color classification remain, greatly reducing the computational overhead.
3. Core Algorithm Design
3.1. YOLO-OBB-Based Rotated Detection and ROI Extraction
The front-end of the system employs a YOLOv8n-OBB rotated object detector. Compared with axis-aligned horizontal bounding boxes, the rotated quadrilateral predicted by OBB tightly encloses the boundary of the LED array, even under composite distortion. This effectively prevents background pixels from entering the subsequent grayscale segmentation and contour detection steps, thereby eliminating the root cause of the clustering failure that occurs in horizontal-box schemes due to the inclusion of background noise.
The YOLOv8n-OBB detector used in this work was originally designed for oriented object detection in aerial and remote sensing imagery, where targets appear at arbitrary angles. In this paper, we adapt it to OCC by training the model to locate the four corner points of the LED array, producing an oriented bounding box that tightly encloses the array boundary even under in-plane rotation and perspective tilt. This yields a clean region of interest (ROI) without background interference, a key advantage over horizontal bounding boxes. However, the detector alone only provides a clean ROI; it does not correct geometric distortion. The novelty of our work is that we then feed this clean ROI into a cascaded two-stage homography correction module, which actively corrects the distorted LED array (i.e., restores it to a frontal, regularly spaced grid) under composite Yaw-Roll distortion. After correction, the LED grid becomes aligned to sub-pixel accuracy. A grid locking mechanism further eliminates repeated detection and correction in subsequent frames, achieving low steady-state latency and error-free transmission under the tested distortion conditions.
The order of the four corner points output by the OBB detector is not determined. The subsequent coarse correction stage requires the corners to be arranged in the order top-left, top-right, bottom-right, bottom-left so that source and destination points are correctly indexed. An extremum-based sorting strategy is adopted. The corner that minimizes x + y is assigned as top-left; the one that maximizes x + y as bottom-right; the one that maximizes x − y as top-right; and the one that minimizes x − y as bottom-left. This rule forces the corners into the sequence top-left → top-right → bottom-right → bottom-left, which establishes a deterministic correspondence with the four vertices of the destination rectangle used in coarse correction and ensures the correctness of the homography solution.
After sorting, the axis-aligned bounding rectangle of the rotated quadrilateral is cropped as the ROI. The sorted corner coordinates are transformed into the ROI-local coordinate system and passed to the coarse correction module described in
Section 3.2.
The detection model is trained on a self-built dataset. The dataset consists of 446 images of the 16 × 16 LED array, split into training (90%) and validation (10%) sets. Each image was manually annotated with oriented bounding boxes (OBB), and the labels were stored as a class_id followed by four normalized corner coordinates (x1, y1, x2, y2, x3, y3, x4, y4). Data augmentation followed the training configuration: HSV color space augmentation (hue shift ±1.5%, saturation shift ±70%, value shift ±40%), translation (±10%), scaling (±50%), horizontal flipping (50% probability), and mosaic (100% probability, disabled in the last 10 epochs). No rotation, shear, perspective, or mixup augmentations were applied. The dataset covers the full range of Yaw and Roll angles used in our experiments (Yaw 0–45°, Roll 0–40°). The dataset is not publicly available but can be obtained from the corresponding author upon reasonable request. The model achieves an mAP@0.5 of 0.995 on the validation set, with both precision and recall at approximately 1.00. This detection accuracy satisfies the requirements of the subsequent correction and clustering tasks.
3.2. Cascaded Two-Stage Perspective Correction
The angular distortion between the LED array and the camera can be described by the perspective projection model , where K is the camera intrinsic matrix, R represents Yaw and Roll rotations, and t is a translation vector. The planar homography H used in our cascaded correction directly follows from this 3D geometry.
Conventional geometric correction for OCC mostly adopts a single-stage strategy, which performs the perspective transformation in one step using either the bounding-box corners or the LED corners directly. The single-stage approach is acceptable when the distortion is mild. Under large-angle composite distortion, however, it faces a fundamental trade-off: detection-box corners are easy to obtain but lack sufficient spatial accuracy, whereas true LED corners offer higher precision but are difficult to extract reliably from a distorted image.
The proposed framework resolves this dilemma by cascading a coarse correction stage and a fine correction stage. The coarse stage uses OBB corners to rapidly remove the dominant perspective distortion, thereby providing a favorable initial condition for LED corner detection. The fine stage further improves the accuracy of extracted LED centers. The two stages are complementary, addressing the problems of “undetectable” and “imprecise,” respectively.
Coarse correction. The sorted OBB corners are taken as the source control points, denoted
c1,
c2,
c3,
c4, corresponding to the top-left, top-right, bottom-right, and bottom-left corners. The destination rectangle vertices are determined from the average edge lengths, with a margin of 30 pixels kept on all four sides. The coarse homography matrix
is solved from the four point correspondences:
Applying
to the ROI image
yields the coarsely corrected image
:
Fine correction. A set of LED centers P is extracted from . The same method is used to obtain the four corner points, denoted , , , . To enhance boundary robustness, each edge of the source quadrilateral is extended outward by 1.2 times the average LED spacing. The destination rectangle is set to 1.1 times the dimensions of . The fine homography matrix is then solved and applied to generate the fine-corrected image .
Matrix concatenation. Two successive resampling operations would introduce accumulated interpolation errors and increase the computational cost. To avoid this, the two homography matrices are combined according to their application order:
After this concatenation, any source point
x can be mapped with a single matrix multiplication:
Once the grid is locked, is cached and reused. Every subsequent frame only needs to apply this single mapping to achieve the two-stage correction accuracy.
The framework also provides geometric calibration metrics. The four edges of the corrected LED grid quadrilateral are extracted: the top edge
, the bottom edge
, the left edge
, and the right edge
. The ratios of opposite sides are defined as
The center offset is defined as , i.e., the Euclidean distance between the grid center and the image center. Under ideal correction, Rh ≈ Rv ≈ 1 and d ≈ 0.
3.3. LED Center Detection and Grid Locking
LED center detection is performed on the fine-corrected image. The grayscale image is binarized with a fixed threshold Tg = 20. This threshold is determined from the statistical distribution of the background grayscale in the ROI when the LEDs are off, and it effectively separates the dark background from the lit LED elements. Connected components are filtered by area and circularity. The area threshold is A > 2 pixels, and the circularity threshold is C > 0.5. After filtering, sub-pixel center coordinates are computed using spatial image moments. Based on the area distribution, outliers are removed by retaining only those points whose areas fall within , yielding the valid center set .
is then grouped into a 16 × 16 regular grid by K-Means clustering. K-Means partitions the point set into
K clusters by minimizing the within-cluster sum of squared errors:
where
is the set of points assigned to the k-th cluster and
is the cluster center. In this system,
K = 16. Clustering is first performed along the vertical axis to obtain 16 rows, and then a second clustering is performed along the horizontal axis within each row to obtain 16 columns. The number of K-Means iterations is fixed at 10. Missing rows are filled by interpolation using the spacing of adjacent rows. Grid construction is considered to have failed if the number of valid points is less than half of the total, or if more than two rows are missing. After clustering, the grid intersections closely coincide with the actual center positions of the LED elements.
Once the grid passes quality validation, the system enters the locked state. The grid is denoted , where represents the expected image coordinate of the LED at logical position (i, j) in the fine-corrected image. This grid is stored in the cache. Thereafter, the per-frame processing is simplified to four operations: cropping the ROI (using cached coordinates), applying , sampling HSV values at the fixed grid positions, and performing color classification followed by a BER update. All LED detection and clustering operations are bypassed. If the BER exceeds a preset threshold over several consecutive frames, the system automatically triggers grid re-locking.
Re-locking mechanism. After grid locking, the system monitors the bit error rate (BER) on a per-frame basis. If the BER exceeds 10
−2 for five consecutive frames, the lock is considered invalid. The system then automatically exits the locked state and restarts the full initialization pipeline (YOLO detection, LED center detection, K-Means clustering, and color classification). The measured average recovery time for this complete re-initialization is 453.01 ms on the hardware platform described in
Section 4.4. During the recovery process, a few frames may be lost, but subsequent decoding resumes correctly. Under static geometric distortion, long-term operation (over 4 h) showed no observable grid drift. The stability under dynamic perturbations (e.g., slight camera movement) has not been systematically evaluated in this study and remains a direction for future investigation.
3.4. Color Classification
The color classification module samples HSV values at the 256 fixed grid intersections. Classification employs a three-level cascaded decision tree, whose flow is illustrated in
Figure 3.
The first level is lit/unlit detection. A dynamic threshold is computed from the median of the V channel within the sampling region as . LEDs for which the proportion of bright pixels is less than 8% are classified as unlit (N). The remaining LEDs proceed to the second level.
The second level separates white from colored LEDs. The mean saturation S of the lit LED is calculated. White LEDs, whose three channels emit simultaneously, usually have S in the 40–100 range. Colored LEDs have S concentrated in the 160–240 range. With 120 as the boundary, is classified as white (W); otherwise, the LED proceeds to the third level.
The third level performs color identification. The mean hue H is evaluated sequentially. If , the LED is classified as red (R). Otherwise, if , it is classified as green (G). Otherwise, if , it is classified as blue (B). If H satisfies none of the above, a majority vote is performed by counting the proportions of pixels belonging to each color category. The majority vote may output red, green, or blue. Only when all three color pixel counts are zero is the LED classified as N.
This cascaded scheme requires no training. The dynamic brightness threshold adapts to changes in ambient lighting. The classification result is a 16 × 16 color matrix, i.e., the decoded data frame. This matrix is then compared element-by-element with the ground-truth matrix to update the cumulative BER and FER.
4. Experiments and Results Analysis
4.1. Experimental Platform and Scheme Configuration
The transmitter consists of a 16 × 16 WS2812 flexible LED array and an ESP32-WROOM-32 development board (Espressif Systems Co., Ltd., Shanghai, China), powered by a 5 V DC supply. The receiver employs a Hikrobot MV-CU120-10UC industrial camera (12-megapixel CMOS sensor, 12 mm fixed-focus lens, Hikrobot Co., Ltd., Hangzhou, China) mounted on a tripod. The image output resolution is 640 × 640 pixels. Experiments were conducted under typical indoor office lighting conditions. The ambient illuminance in front of the camera, with the LED array completely turned off, was approximately 130–160 lux. The communication distance was fixed at 1 m. The experimental setup is shown in
Figure 4.
The key system parameters are summarized in
Table 1.
The tilt angle (Yaw) and in-plane rotation angle (Roll) of the LED panel were preset using a protractor. After manually rotating the panel to the marked positions, the angles were verified with a smartphone inclinometer. The actual angular deviation from the preset values did not exceed ±2°. By combining different Yaw and Roll preset values, a total of 16 cross-test conditions were formed.
To clearly define the comparison schemes, the three configurations are defined as follows:
Scheme A (Baseline): A manually selected fixed rectangle is used as the ROI. No geometric correction is performed, and color sampling is carried out on a uniformly spaced fixed grid.
Scheme B (HBB + single-stage correction): To establish a representative baseline, Scheme B follows the deep learning processing paradigm commonly adopted in recent OCC literature (e.g., Nguyen et al. [
15], Sitanggang et al. [
30]). This scheme uses a conventional YOLOv8n horizontal bounding box (HBB) for ROI extraction, followed by a standard single-stage perspective homography transformation. Moreover, LED center detection and clustering are performed independently on every frame without any frame-to-frame acceleration strategy such as locking. This scheme belongs to the traditional uncorrected, frame-by-frame processing type.
Scheme C (Proposed): YOLOv8n-OBB rotated bounding box detection is adopted. A cascaded coarse-to-fine two-stage correction is applied. After the first frame successfully completes clustering, the grid is locked. Subsequent frames reuse the cached grid and the combined transformation matrix.
The evaluation metrics include the bit error rate (BER), frame error rate (FER), 95% confidence interval (CI), and the average per-frame processing latency. All BER measurements are based on no fewer than 1500 valid frames. With a maximum of 473 bits per frame, the cumulative number of transmitted bits over 1500 frames is approximately
. When the observed BER = 0, the half-width of the 95% confidence upper bound is
, i.e., the true BER does not exceed this upper bound with 95% confidence. Therefore, BER entries marked as 0 in
Table 2 correspond to a true BER lower than
.
4.2. BER Performance Comparison
Table 2 summarizes the BER and FER data of the three schemes under 16 distortion combinations. The 95% confidence interval of the BER is calculated as follows:
where
is the observed BER and
N is the cumulative number of transmitted bits. When the observed BER = 0, the 95% confidence upper bound is
, i.e., the true BER does not exceed
at the 95% confidence level.
Three progressive conclusions can be drawn from
Table 2. First, Scheme A exhibits severe bit errors or complete failure under any non-zero angle, with a residual BER of
even at 0°. BER generally degrades with distortion, but not strictly monotonically. Second, Scheme B (representing the mainstream HBB paradigm) achieves error-free decoding under pure perspective tilt (Yaw-only) but fails completely when joint in-plane rotation (Roll) is introduced. This failure exposes the inherent vulnerability of standard axis-aligned object detectors in unconstrained OCC channels. When the LED array undergoes Roll rotation, the conventional HBB inevitably engulfs extensive out-of-domain background pixels, such as ambient reflections. This severe spatial noise directly corrupts the subsequent binarization and clustering processes, leading to synchronization and decoding failure. The vulnerability of Scheme B under joint Yaw-Roll distortions fully demonstrates that standard deep-learning toolchains are inadequate for complex geometric channels, thereby justifying the necessity of our proposed OBB-driven framework. Third, Scheme C achieves error-free transmission under all 16 tested conditions.
To intuitively illustrate the performance difference between Scheme B and Scheme C,
Figure 5 presents the BER comparison of the two schemes in the form of a heatmap across all 16 angle combinations.
4.3. Ablation Study and Correction Accuracy Analysis
To verify the necessity of each stage in the cascaded framework, an ablation study was conducted using the Yaw 30°, Roll 30° condition as an example. Each ablation configuration was measured three times, and the average values are reported in
Table 3. “Coarse only” refers to performing only the OBB-corner-based coarse correction while skipping the fine correction; “Fine only” refers to skipping the coarse correction and directly detecting LED corners on the original distorted ROI for fine correction. With coarse correction alone, although the edge-length ratios approach 1.03, residual sub-pixel deviations still result in a BER on the order of
. With fine correction alone, grid construction fails completely because the LED corners cannot be reliably extracted under composite distortion. The cascaded strategy optimizes the ratios of opposite sides to 0.998 and achieves error-free transmission.
Figure 6 presents the results of the ablation study in a dual-axis chart.
Table 4 lists the geometric calibration metrics of Scheme C under three representative conditions. Even under composite distortion,
Rh and
Rv remain stable between 0.997 and 1.004, and the center offset is less than 2 pixels, confirming the high-precision alignment capability of the cascaded correction.
4.4. Processing Latency and Real-Time Performance
Hardware and inference configuration. All latency measurements were performed on the following platform:
CPU: AMD Ryzen 7 4800U (8 cores, 16 threads, base frequency 1.80 GHz)
RAM: 16 GB
GPU: None (only integrated AMD Radeon Graphics, not used for computation)
CUDA acceleration: Not used (inference was executed on CPU only)
YOLO backend: Ultralytics YOLOv8 framework with PyTorch 2.7.0 backend
Image resolution: 640 × 640 pixels
All compared schemes (Schemes A, B, and C) were evaluated on the same platform to ensure a fair comparison.
The average per-frame processing latency was measured under the same Yaw 30°, Roll 30° condition used in the ablation study. Scheme A, which had already failed under this condition, was excluded from the latency comparison. The results are shown in
Figure 7.
Scheme B requires a complete run of YOLOv8n horizontal-box inference, single-stage perspective correction, LED center detection, K-Means clustering, and color classification for every frame, consuming 87.3 ms.
The latency of Scheme C exhibits a distinct two-stage characteristic. Before locking, the first frame completes the entire initialization pipeline, taking approximately 150 ms. After locking, each frame retains only ROI cropping, combined matrix mapping, and fixed-grid color sampling, reducing the latency to 78.9 ms—a reduction of approximately 47% compared with the pre-locking stage.
The post-locking latency remains stable in the 76–83 ms range across all 16 distortion conditions, independent of the degree of distortion. The four operations after locking all have fixed computational complexity: ROI cropping is an image array slicing operation, combined matrix mapping is a fixed-resolution perspective transformation, grid sampling computes the neighborhood mean for 256 positions, and color classification traverses a three-level decision tree. None of these involves iterative convergence or object detection, satisfying the real-time requirement.
At a frame rate of 5 fps, the available processing window per frame is 200 ms. The post-locking latency of 78.9 ms for Scheme C occupies only about 39% of the frame period, leaving ample computational margin for higher-order modulation or more complex decoding in the future.
Computational complexity and parameter count comparison. Our YOLOv8n-OBB detector has approximately 3.1 million parameters and costs about 9.1 billion FLOPs for a 640 × 640 input (estimated by scaling the official benchmark value of 23.3 billion FLOPs at 1024 × 1024 resolution by the area ratio ). After grid locking, the core decoding per frame (perspective warp + color classification) takes 45.85 ms on average, consisting of 10.46 ms for warping and 35.37 ms for color classification. Even when visualization and GUI overhead are included, the total per-frame latency is 78.91 ms, which is significantly lower than the 200 ms budget allowed for 5 fps. All heavy operations (YOLO inference, coarse/fine correction, LED center detection, K-Means clustering) are executed only once during initialization.
Compared with horizontal-bounding-box (HBB) based deep learning detectors (e.g., Nguyen and Jang [
15], Jocher et al. [
16]), which run detection on every frame and do not handle geometric distortion, our method offers fundamental advantages. Traditional grayscale thresholding [
14] has low complexity but fails under composite distortion. Other related works, such as the derivative-based demodulation method [
23] and the gapless sampling method [
24], are designed for symbol detection or localization and do not include geometric correction or frame locking. In contrast, our OBB tightly encloses the LED array, and the cascaded two-stage homography correction enables BER = 0 under composite 45° Yaw and 40° Roll distortion. Moreover, the grid locking mechanism executes YOLO inference only once; after that, each frame only requires a fixed set of simple operations (warping and color classification), resulting in a constant per-frame workload. Hence, with a parameter count similar to that of HBB detectors, our scheme achieves error-free transmission under composite distortion and extremely low steady-state latency. Based on our literature review, this combination of features has not been reported in previous OCC works.
4.5. Quantitative Comparison with State-of-the-Art (SOTA) Schemes
To evaluate system advancement, this section compares the proposed system with recent representative SOTA OCC and MIMO-VLC schemes. To eliminate variations in diverse testing environments and hardware performance, Scheme B is introduced as a benchmark baseline. Data for other studies [
28,
29,
30,
31,
32,
33] are directly cited from their original publications, as detailed in
Table 5.
By comparing the quantitative data in
Table 5, two core technical boundaries can be defined:
Trade-off between distortion tolerance and hardware cost: Literature [
29] relies on a high-cost adaptive liquid lens hardware route to mitigate multi-angle tilts, achieving a BER of
. In contrast, the proposed scheme achieves distortion correction purely through receiver-side algorithmic innovation, utilizing the oriented rigid boundary of YOLOv8n-OBB to precisely demarcate the region of the tilted LED array and suppress spatial edge noise. Compared with the baseline schemes, traditional horizontal bounding box (HBB) architectures (e.g., Scheme B and Literature [
30]) suffer from a complete collapse of topological synchronization under in-plane rotation (Roll) distortion. Conversely, without requiring specialized optical components or transmitter-side modifications [
32], the proposed scheme achieves error-free decoding (BER = 0) under extreme composite distortions of 45° Yaw and 40° Roll.
Trade-off between computational efficiency and hardware overhead: Under the strict constraints of a high-density topology (256 nodes) and a pure CPU platform, the proposed scheme utilizes a spatio-temporally decoupled pipeline. This optimizes the single-frame steady-state delay to 78.9 ms, significantly outperforming Scheme B (87.3 ms) which requires frame-by-frame full boundary resolution. This demonstrates an excellent balance between geometric robustness and steady-state efficiency on low-cost edge platforms.
5. Discussion
The proposed scheme successfully balances geometric robustness and computational efficiency. In the spatial domain, YOLOv8n-OBB provides a background-free ROI, overcoming the limitation of HBB methods under rotation. In the mathematical domain, two-stage homography matrices are cascaded into a single transformation, enabling sub-pixel alignment with only one resampling per frame after locking. This pure software receiver architecture requires no hardware modification.
The grid locking strategy is key to real-time performance. All heavy operations (YOLO inference, homography recomputation, K-Means clustering) are executed only once during initialization, reducing the steady-state per-frame latency to 78.9 ms on a pure CPU platform, well below the 200 ms budget for 5 fps. Static tests over 4 h showed no grid drift. A BER-monitored re-locking mechanism provides autonomous recovery capability.
The current experiments were conducted under fixed conditions (1 m distance, indoor lighting, 5 fps, static distortion). While these are sufficient to validate the core innovations, practical OCC scenarios involve varying distances, illumination, dynamic motion, and camera settings. Systematic evaluation under such diverse conditions is left for future work. Nevertheless, this work bridges the gap between theoretical 3D geometric analyses and a deployable, software-only receiver-side correction for composite distortion.