Deep Learning-Assisted Autofocus for Aerial Cameras in Maritime Photography
Abstract
1. Introduction
2. System Architecture
2.1. Focusing Principle
2.2. System Design
3. Focusing Algorithm
3.1. Coarse Focusing Positioning Based on Deep Learning
3.1.1. Network Architecture and Model Design
- Accuracy-Speed Trade-off: MobileNetV2 achieves the best balance between accuracy (MAE = 8.5) and inference speed (38 ms). Although EfficientNet-Lite0 offers slightly higher accuracy, it increases the inference time by 37%.
- Deployment Maturity: MobileNetV2 has the most mature deployment toolchain on embedded platforms, with stable conversion to ONNX/TensorRT.
- Memory Footprint: With only 2.3 M parameters, it operates reliably under the 4 GB memory constraint of RK3588J, leaving ample headroom.
- A global average pooling layer averages the final feature map (16 × 16 × 1280) across spatial dimensions to produce a 1280-dimensional feature vector. This reduces the parameter count and enhances the model’s robustness to spatial translation.
- A fully connected layer with 512 neurons (using ReLU activation and Dropout = 0.3) placed after the pooling layer, which integrates all feature information and performs non-linear transformation.
3.1.2. Dataset Construction and Model Training Strategy
- Geometric Transformations: Random horizontal/vertical flipping (probability: 50%), small-angle random rotation (±5°, probability: 30%), and random translation (±10 pixels).
- Photometric Transformations: Random adjustment of image brightness (scaling factor range: [0.9, 1.1]) and contrast (scaling factor range: [0.9, 1.1]) to simulate imaging variations under different lighting conditions.
- Noise Injection: Gaussian noise (σ = 0.01) is added to increase the model’s robustness to sensor noise.
3.1.3. Coarse Focusing Inference Pipeline
- Image Capture: The imaging sensor captures a current image of the built-in resolution chart.
- Forward Propagation: This image is fed into the deployed lightweight MobileNetV2 model.
- Position Prediction: The model performs a single forward pass, directly outputting a continuous predicted encoder position Ppred.
- Actuation: The focus controller receives the Ppred command and drives the motor to rapidly position the internal focusing lens near this predicted location.
3.1.4. Quantitative Evaluation of Model Performance
- Mean Absolute Error (MAE): 8.5 encoder units
- Root Mean Square Error (RMSE): 11.2 encoder units
- Coefficient of Determination (R2): 0.987
- Maximum Prediction Error: 28 encoder units
- Prediction Error within 95% Confidence Interval: ≤18 encoder units
3.2. Search-Based Fine and Ultra-Fine Focusing
3.2.1. Image Sharpness Evaluation Function
- SMSG demonstrates good unimodality when applied to a built-in high-contrast target;
- Its peak sharpness, measured by the Full Width at Half Maximum (FWHM), is moderate. This provides clear peak localization while avoiding instability that can arise from overly narrow peaks;
- It achieves the highest signal-to-noise ratio (38.5 dB), indicating strong robustness against sensor noise;
- Its computation time (2.3 ms per frame) meets real-time requirements.
3.2.2. Fine Focusing Stage Algorithm
| Algorithm 1. Parameter Definitions: |
|
| Algorithm 2. Fine Focusing Stage |
2. Sharpnessprev ← ComputeSMSG(CaptureImage(Pcurrent)) 3. Sharpnessmax ← Sharpnessprev 4. Pbest ← Pcurrent 5. direction ← -1 // Initially search in the direction of decreasing encoder value 6. step ← Sfine 7. iteration ← 0 8. WHILE iteration < Nmax DO 9. Pnext ← Pcurrent + direction × step 10. IF Pnext < 0 OR Pnext > 512 THEN 11. direction ← -direction // Reverse direction at boundary 12. CONTINUE 13. END IF 14. 15. Sharpnesscurrent ← ComputeSMSG(CaptureImage(Pnext)) 16. ΔSharpness ← (Sharpnesscurrent-Sharpnessprev)/Sharpnessprev 17. 18. IF Sharpnesscurrent > Sharpnessmax THEN 19. Sharpnessmax ← Sharpnesscurrent 20. Pbest ← Pnext 21. END IF 22. 23. IF ΔSharpness < -ε THEN // Sharpness decline detected, peak passed 24. BREAK // Proceed to the super-fine stage 25. ELSE IF |ΔSharpness| < δ THEN // Sharpness change is flat 26. step ← step × α // Reduce step size 27. IF step < 1 THEN step ← 1 END IF 28. END IF 29. 30. Sharpnessprev ← Sharpnesscurrent 31. Pcurrent ← Pnext 32. iteration ← iteration + 1 33. END WHILE 34. RETURN Pbest, Sharpnessmax |
3.2.3. Ultra-Fine Focusing Stage Algorithm
- Using a smaller initial step size for higher positioning accuracy;
- The search direction is reversed, approaching the peak from the opposite side of the sharpness curve to verify and precisely lock onto the optimal position.
- Algorithm Parameter Definitions:
- Initial step size Sultra = 2 encoder units (one-fourth of the fine stage step size)
- Step decay factor α = 0.5
- Sharpness change threshold δ = 0.005 (stricter convergence criterion)
- Peak detection threshold ε = 0.005
- Maximum iteration steps Nmax = 10
- Minimum step size Smin = 1 encoder unit
| Algorithm 3. Ultra-Fine Focusing Stage |
2. Sharpnessmax ← Sharpnessfine 3. Poptimal ← Pfine 4. direction ← +1 // Reverse search direction (opposite to the fine stage) 5. step ← Sultra 6. iteration ← 0 7. consecutivedecline ← 0 8. WHILE iteration < Nmax AND step >= Smin DO 9. Pnext ← Pcurrent + direction × step 10. 11. IF Pnext < 0 OR Pnext > 512 THEN 12. BREAK // Boundary reached, terminate search 13. END IF 14. 15. Sharpnesscurrent ← ComputeSMSG(CaptureImage(Pnext)) 16. ΔSharpness ← (Sharpnesscurrent-Sharpnessmax) / Sharpnessmax 17. 18. IF Sharpnesscurrent > Sharpnessmax THEN 19. Sharpnessmax ← Sharpnesscurrent 20. Poptimal ← Pnext 21. consecutivedecline ← 0 22. ELSE IF ΔSharpness < -ε THEN 23. consecutivedecline ← consecutivedecline + 1 24. IF consecutivedecline >= 2 THEN 25. BREAK // Two consecutive declines confirm passing the peak 26. END IF 27. step ← step × α // Reduce step size and retry 28. END IF 29. 30. IF |ΔSharpness| < δ AND step > Smin THEN 31. step ← step × α // Reduce step size when change is flat 32. END IF 33. 34. Pcurrent ← Pnext 35. iteration ← iteration + 1 36. END WHILE 37. MoveToPosition(Poptimal) // Move to the optimal position 38. RETURN Poptimal |
- Fine Stage: This stage performs a coarse-grained hill-climbing search (initial step size: 8 units) aimed at rapidly approaching the peak region while tolerating a certain degree of overshoot.
- Ultra-Fine Stage: This stage performs fine-grained local refinement (initial step size: 2 units), approaching from the opposite direction to precisely lock onto the peak and minimize overshoot as much as possible.
4. Analysis of System Operational Results
4.1. Aerial Camera Platform Parameters
4.2. Comparison of Focusing Algorithm Execution Processes
4.2.1. Ablation Study
- Configuration A: Traditional hill-climbing method only (full-range search, without CNN assistance)
- Configuration B: CNN coarse focusing only (evaluates the absolute error from direct locking after a single prediction)
- Configuration C: The proposed hybrid method (CNN coarse focusing + hill-climbing fine search)
- Configuration A (hill-climbing only) achieves relatively high accuracy but is time-consuming, with a success rate of only 85% (15% of trials became trapped in local extrema).
- Configuration B (CNN only) is the fastest but lacks sufficient accuracy to meet high-quality imaging requirements.
- Configuration C (hybrid method) achieves the best overall performance: it improves the speed by 60%, matches the accuracy of the pure hill-climbing method, and achieves a 100% success rate.
4.2.2. Accuracy and Stability Evaluation
4.2.3. End-to-End Timing Measurements
4.3. Laboratory Validation of the Focusing Algorithm
- Encoder Position Timing Curve: Figure 13 shows the timing curve of the lens encoder position during the dynamic test. After initial convergence (approximately 700 ms), the encoder position exhibits minor fluctuations (±1.5 encoder units) around the optimal focus, indicating the system’s good dynamic tracking capability.
- Focus Jitter Metric: The standard deviation of the sharpness values across 50 consecutive frames under dynamic conditions is 0.012, which is only twice that under static conditions (0.006). This finding indicates that the motion-induced focus jitter is within an acceptable range.
- Response Speed Metric: The average time required for the system to first achieve over 95% of the peak sharpness from focus trigger initiation is 623 ms (static) and 687 ms (dynamic). The response speed under dynamic conditions decreases by only about 10%.
4.4. Field Testing for Maritime Photography
- Flight speed: 33.81 m/s
- Flight altitude: 3156 m
- Sea state: Level 2–3 (slight to moderate waves)
- Weather conditions: Cloudy, visibility > 10 km
- Test duration: Approximately 45 min
- Number of captured images: 127
- The average SMSG sharpness during flight (0.847) reached 97.9% of the manual optimal focus reference value (0.865), indicating that the system can maintain a focus performance close to optimal in real maritime conditions.
- A total of 94.5% of the images had a sharpness value exceeding 0.8 (the preset high-quality threshold), demonstrating the system’s good stability.
- The standard deviation of the sharpness values was only 0.032, indicating consistent focusing quality throughout the flight.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Liu, Y.; Wang, Y.; Wang, Y. Design of optical system for large-field of view aerocamera. J. Appl. Opt. 2019, 40, 980–986. [Google Scholar] [CrossRef]
- Liu, X.J.; Ding, Y.L.; Li, F.; Chen, Z.C.; Yuan, G.Q.; Liu, Z.M. Theoretical analysis and verification of Ronchi grating auto-collimated focusing for aerial remote camera. Opt. Precis. Eng. 2020, 28, 1236–1244. [Google Scholar] [CrossRef]
- Chen, Z.; Zhang, T. Investigation of an Autofocusing Method for Visible Aerial Cameras Based on Image Processing Techniques. Int. J. Aerosp. Eng. 2016, 2016, 1819472. [Google Scholar] [CrossRef]
- Zhu, H.; Liang, W.; Gao, X.D. Autofocusing system with Opto-electronic Auto-collimation Method for Aerial Camera. Opto-Electron. Eng. 2011, 38, 35–39. [Google Scholar] [CrossRef]
- Zhu, Z.; Tan, W. On-orbit Automatic Focusing Method for BJ-3B Satellite Camera. Spacecr. Eng. 2023, 32, 64–71. [Google Scholar] [CrossRef]
- Yang, J.; Du, J.; Li, F.; Chen, Q.; Wang, S.; Yan, W. Deep Learning Based Method for Automatic Focus Detection in Digital Lithography. Acta Photonica Sin. 2022, 51, 0611002. [Google Scholar] [CrossRef]
- Alsanea, M.; Habib, S.; Khan, N.F.; Alsharekh, M.F.; Islam, M.; Khan, S. A Deep-Learning Model for Real-Time Red Palm Weevil Detection and Localization. J. Imaging 2022, 8, 170. [Google Scholar] [CrossRef] [PubMed]
- Patel, K.; Bhatt, C.; Mazzeo, P.L. Deep Learning-Based Automatic Detection of Ships: An Experimental Study Using Satellite Images. J. Imaging 2022, 8, 182. [Google Scholar] [CrossRef] [PubMed]
- Ge, J.; Qin, G.; Zhang, W. Lightweight self-supervised monocular depth estimation combining multi-scale attention. J. Xidian Univ. 2025, 52, 66–76. [Google Scholar] [CrossRef]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Firestone, L.; Cook, K.; Culp, K.; Talsania, N.; Preston, K., Jr. Comparison of autofocus methods for automated microscopy. Cytometry 1991, 12, 195–206. [Google Scholar] [CrossRef] [PubMed]
- Sun, Y.; Duthaler, S.; Nelson, B.J. Autofocusing in computer microscopy: Selecting the optimal focus algorithm. Microsc. Res. Tech. 2004, 65, 139–149. [Google Scholar] [CrossRef] [PubMed]
- Pertuz, S.; Puig, D.; Garcia, M.A. Analysis of focus measure operators for shape-from-focus. Pattern Recognit. 2013, 46, 1415–1432. [Google Scholar] [CrossRef]
- Brenner, J.F.; Dew, B.S.; Horton, J.B.; King, T.; Neurath, P.W.; Selles, W.D. An automated microscope for cytologic research: A preliminary evaluation. J. Histochem. Cytochem. 1976, 24, 100–111. [Google Scholar] [CrossRef] [PubMed]














| Model | Parameters (M) | FLOPs (G) | Inference Time on RK3588J (ms) | Test Set MAE | Test Set R2 |
|---|---|---|---|---|---|
| MobileNetV2 | 2.3 | 0.32 | 38 | 8.5 | 0.987 |
| EfficientNet-Lite0 | 4.7 | 0.41 | 52 | 7.9 | 0.989 |
| ShuffleNetV2 | 1.4 | 0.15 | 28 | 11.2 | 0.978 |
| MobileViT-XXS | 1.3 | 0.42 | 67 | 9.1 | 0.984 |
| Layer/Block Type | Configuration | Output Size |
|---|---|---|
| Input | - | 512 × 512 × 1 |
| Conv2D | 3 × 3, stride = 2 | 256 × 256 × 32 |
| Inverted Residual Block 1 | 3 × 3, stride = 2 (Downsample) | 256 × 256 × 16 |
| Inverted Residual Block 2 | 3 × 3, stride = 2 (Downsample) | 64 × 64 × 24 |
| Inverted Residual Block 3 | 3 × 3, stride = 2 (Downsample) | 32 × 32 × 32 |
| Inverted Residual Block 4 | 3 × 3, stride = 1 | 32 × 32 × 64 |
| Inverted Residual Block 5 | 3 × 3, stride = 2 (Downsample) | 16 × 16 × 96 |
| Inverted Residual Block 6 | 3 × 3, stride = 1 | 16 × 16 × 160 |
| Inverted Residual Block 7 | 3 × 3, stride = 1 | 16 × 16 × 320 |
| Conv2D (1 × 1) | 1 × 1 | 16 × 16 × 1280 |
| Regression Head | ||
| Global Average Pooling | - | 1 × 1 × 1280 |
| Fully Connected (ReLU) | 512 units | 512 |
| Output (Linear) | 1 unit | 1 |
| Parameter | Value |
|---|---|
| Number of raw images | 1280 |
| Total images after augmentation | 12,000 |
| Encoder position range | 0–512 |
| Number of discrete sampling positions | 128 |
| Sampling interval | 4 encoder units |
| Image resolution | 512 × 512 pixels |
| Train/Val/Test split | 8:1:1 |
| Parameter | Specification |
|---|---|
| Training Set Size | 9600 images |
| Validation Set Size | 1200 images |
| Test Set Size | 1200 images |
| Hardware Environment | NVIDIA GeForce RTX 4090D |
| Software Framework | PyTorch 1.12.0 |
| Total Epochs | 200 |
| Deployment Platform | Rockchip RK3588J |
| Metric | Value | Physical Meaning |
|---|---|---|
| MAE | 8.5 encoder units | 42.5 µm |
| RMSE | 11.2 encoder units | 56 µm |
| R2 | 0.987 | — |
| Maximum Error | 28 encoder units | 140 µm |
| System Depth of Field | 20 encoder units | 100 µm |
| Error/DoF Ratio | 42.5% | Within the capture range |
| Metric | Unimodality | Peak Sharpness (FWHM) | SNR (dB) | Single-Frame Computation Time (ms) |
|---|---|---|---|---|
| SMSG | Yes | 12 encoder units | 38.5 | 2.3 |
| Brenner | Yes | 18 encoder units | 32.1 | 1.8 |
| Tenengrad | Yes | 14 encoder units | 36.2 | 2.1 |
| Laplacian Var. | Yes | 10 encoder units | 28.7 | 2.5 |
| Parameter Name | Parameter Value |
|---|---|
| Imaging Method | Area-array imaging |
| Imaging Device Resolution | 8424 × 6032 |
| Pixel Size | 4.6 μm |
| Focal Length | 126 mm |
| System Depth of Field | 100 μm |
| Capture Cycle | 250 ms |
| Gray-scale resolution | 12-bit |
| Encoder disk | 512 |
| Focusing servo mechanism | Brushless motor, encoder |
| Main Control System | Embedded RK3588J |
| Configuration | Average Steps | Average Time (ms) | Final Position Error (Encoder Units) | Final SMSG Value | Success Rate |
|---|---|---|---|---|---|
| A: Hill-climbing only | 23.2 ± 3.1 | 1856 ± 248 | 1.2 ± 0.8 | 0.982 ± 0.012 | 85% |
| B: CNN only | 1 | 45 ± 3 | 8.5 ± 4.2 | 0.891 ± 0.067 | 62% |
| C: Hybrid method | 9.1 ± 1.4 | 728 ± 112 | 0.9 ± 0.6 | 0.989 ± 0.008 | 100% |
| Metric | Traditional Hill-Climbing Method | Hybrid Method | Improvement |
|---|---|---|---|
| Final Position Repeatability Error (σ) | 2.1 encoder units | 1.3 encoder units | 38% reduction |
| Mean Final SMSG Value | 0.982 | 0.989 | 0.7% increase |
| Standard Deviation of SMSG Value | 0.015 | 0.008 | 47% reduction |
| Success Rate (convergence within ±2 units) | 85% | 100% | 15% increase |
| Rate of Falling into Local Extrema | 15% | 0% | 100% reduction |
| Stage | Average Time Cost | Standard Deviation |
|---|---|---|
| Single inference of the CNN model | 38 ms | ±3 ms |
| Single image capture | 33 ms | ±2 ms |
| Single SMSG computation | 2.3 ms | ±0.2 ms |
| Single motor movement response | 45 ms | ±5 ms |
| Total time for coarse focusing stage | 116 ms | ±8 ms |
| Total time for fine focusing stage | 402 ms | ±45 ms |
| Total time for ultra-fine focusing stage | 241 ms | ±32 ms |
| Total end-to-end time for the full process | 728 ms | ±112 ms |
| Test Condition | SMSG Sharpness Value | Position Stability (σ) | Convergence Time (ms) | Inter-Frame Sharpness Std Dev |
|---|---|---|---|---|
| Static | 0.991 | 0.8 encoder units | 685 | 0.006 |
| Dynamic (3°, 0.5 Hz) | 0.983 | 1.4 encoder units | 742 | 0.012 |
| Performance Retention | 99.2% | - | 92.3% | - |
| Metric | Value |
|---|---|
| Average SMSG Sharpness during Flight | 0.847 |
| Standard Deviation of Sharpness Values | 0.032 |
| Maximum Sharpness Value | 0.912 |
| Minimum Sharpness Value | 0.781 |
| Manual Optimal Focus Reference Sharpness | 0.865 |
| Sharpness Retention (relative to manual optimal) | 97.9% |
| Proportion of Images with Sharpness > 0.8 | 94.5% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Liu, H.; Li, Y.; Xu, S.; Wang, H.; Fu, Q.; Jiang, H. Deep Learning-Assisted Autofocus for Aerial Cameras in Maritime Photography. J. Imaging 2026, 12, 31. https://doi.org/10.3390/jimaging12010031
Liu H, Li Y, Xu S, Wang H, Fu Q, Jiang H. Deep Learning-Assisted Autofocus for Aerial Cameras in Maritime Photography. Journal of Imaging. 2026; 12(1):31. https://doi.org/10.3390/jimaging12010031
Chicago/Turabian StyleLiu, Haiying, Yingchao Li, Shilong Xu, Haoyu Wang, Qiang Fu, and Huilin Jiang. 2026. "Deep Learning-Assisted Autofocus for Aerial Cameras in Maritime Photography" Journal of Imaging 12, no. 1: 31. https://doi.org/10.3390/jimaging12010031
APA StyleLiu, H., Li, Y., Xu, S., Wang, H., Fu, Q., & Jiang, H. (2026). Deep Learning-Assisted Autofocus for Aerial Cameras in Maritime Photography. Journal of Imaging, 12(1), 31. https://doi.org/10.3390/jimaging12010031

