#
Lightweight and Energy-Aware Monocular Depth Estimation Models for IoT Embedded Devices: Challenges and Performances in Terrestrial and Underwater Scenarios^{ †}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

- We propose two lightweight MDE deep learning models that are able to achieve accurate estimations and fast inference frequencies on two embedded IoT devices;
- We conduct a feasibility study of such architectures in underwater scenes;
- We conduct a series of energy measurements during the inference of the proposed models to statistically prove their effective difference in the mean average consumption.

## 2. Related Works

#### 2.1. Lightweight Terrestrial Monocular Depth Estimation

#### 2.2. Underwater Depth Estimation

#### 2.3. Energy-Oriented Models

## 3. Proposed Method

## 4. Implementation Details

## 5. Results

#### 5.1. Encoders

#### 5.2. Decoder

- UpConv2D: a $2\times 2$ up-sampling layer followed by a $3\times 3$ convolution.
- UpDSConv2D: a $2\times 2$ up-sampling layer followed by a $3\times 3$ depthwise separable convolution.
- NNConv5 [12]: a $5\times 5$ convolution followed by a $2\times 2$ up-sampling with the nearest-neighbor interpolation.
- TConv2D: a $3\times 3$ transposed convolution followed by a $3\times 3$ convolution.
- TDSConv2D: a $3\times 3$ transposed convolution followed by a $3\times 3$ depthwise separable convolution.

- The use of depthwise separable convolutions guarantees a boost of the frame rate in a range from $150\%$ up to $200\%$ with respect to classical convolutions.
- The transposed convolution layers achieve a comparable speed with respect to up-sampling operations (with a difference less than 2 fps) with an average RMSE improvement of almost $3\%$.

#### 5.3. Input–Output Resolution

#### 5.4. Feasibility Study in Underwater Settings

## 6. Energy Assessment

#### 6.1. Current and Voltage Footprint

`time.sleep() in order to highlight in the charts when the inference process started`); then, the C phase regards the inference, and finally, in the D phase, the device returns to the idle state. What we can observe is that, first of all, the sleep phase has an instant current absorption of 700 mA, which is slightly higher than the idle phase in which we have 680 mA. Then, during the inference, we can easily distinguish each analyzed frame. This is because the model runs at about 4.60 FPS. Every frame inference begins with a current drop of about 100 mA on average due to the passage to the next frame and the tensor allocation for the inference. The instant power consumption during the inference is, on average, 4.589 W.

#### 6.2. Models Inference Energy Consumption

## 7. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Papa, L.; Russo, P.; Amerini, I. Real-time monocular depth estimation on embedded devices: Challenges and performances in terrestrial and underwater scenarios. In Proceedings of the 2022 IEEE International Workshop on Metrology for the Sea, Milazzo, Italy, 3–5 October 2022; Learning to Measure Sea Health Parameters (MetroSea). pp. 50–55. [Google Scholar] [CrossRef]
- Li, Z.; Chen, Z.; Liu, X.; Jiang, J. DepthFormer: Exploiting Long-Range Correlation and Local Information for Accurate Monocular Depth Estimation. arXiv
**2022**, arXiv:2203.14211. [Google Scholar] - Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision Transformers for Dense Prediction. arXiv
**2021**, arXiv:2103.13413. [Google Scholar] - Bhat, S.F.; Alhashim, I.; Wonka, P. AdaBins: Depth Estimation using Adaptive Bins. arXiv
**2020**, arXiv:2011.14141. [Google Scholar] - Alhashim, I.; Wonka, P. High Quality Monocular Depth Estimation via Transfer Learning. arXiv
**2019**, arXiv:1812.11941. [Google Scholar] - Kist, A.M. Deep Learning on Edge TPUs. arXiv
**2021**, arXiv:2108.13732. [Google Scholar] - Yazdanbakhsh, A.; Seshadri, K.; Akin, B.; Laudon, J.; Narayanaswami, R. An Evaluation of Edge TPU Accelerators for Convolutional Neural Networks. arXiv
**2021**, arXiv:2102.10423. [Google Scholar] - Peluso, V.; Cipolletta, A.; Calimera, A.; Poggi, M.; Tosi, F.; Mattoccia, S. Enabling Energy-Efficient Unsupervised Monocular Depth Estimation on ARMv7-Based Platforms. In Proceedings of the 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE), Florence, Italy, 25–29 March 2019. [Google Scholar] [CrossRef]
- Poggi, M.; Aleotti, F.; Tosi, F.; Mattoccia, S. Towards real-time unsupervised monocular depth estimation on CPU. arXiv
**2018**, arXiv:1806.11430. [Google Scholar] - Peluso, V.; Cipolletta, A.; Calimera, A.; Poggi, M.; Tosi, F.; Aleotti, F.; Mattoccia, S. Monocular Depth Perception on Microcontrollers for Edge Applications. IEEE Trans. Circuits Syst. Video Technol.
**2021**, 32, 1524–1536. [Google Scholar] [CrossRef] - Papa, L.; Alati, E.; Russo, P.; Amerini, I. SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings. IEEE Access
**2022**, 10, 44881–44890. [Google Scholar] [CrossRef] - Wofk, D.; Ma, F.; Yang, T.J.; Karaman, S.; Sze, V. FastDepth: Fast Monocular Depth Estimation on Embedded Systems. arXiv
**2019**, arXiv:1903.03273. [Google Scholar] - Spek, A.; Dharmasiri, T.; Drummond, T. CReaM: Condensed Real-time Models for Depth Prediction using Convolutional Neural Networks. arXiv
**2018**, arXiv:1807.08931. [Google Scholar] - Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor Segmentation and Support Inference from RGBD Images. ECCV
**2012**, 7576, 746–760. [Google Scholar] - Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012. [Google Scholar]
- Ye, X.; Li, Z.; Sun, B.; Wang, Z.; Xu, R.; Li, H.; Fan, X. Deep Joint Depth Estimation and Color Correction From Monocular Underwater Images Based on Unsupervised Adaptation Networks. IEEE Trans. Circuits Syst. Video Technol.
**2020**, 30, 3995–4008. [Google Scholar] [CrossRef] - Gupta, H.; Mitra, K. Unsupervised Single Image Underwater Depth Estimation. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 624–628. [Google Scholar] [CrossRef] [Green Version]
- Peng, Y.T.; Zhao, X.; Cosman, P.C. Single underwater image enhancement using depth estimation based on blurriness. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 4952–4956. [Google Scholar] [CrossRef]
- Drews, P.L.; Nascimento, E.R.; Botelho, S.S.; Montenegro Campos, M.F. Underwater Depth Estimation and Image Restoration Based on Single Images. IEEE Comput. Graph. Appl.
**2016**, 36, 24–35. [Google Scholar] [CrossRef] - Daghero, F.; Pagliari, D.J.; Poncino, M. Chapter Eight - Energy-efficient deep learning inference on edge devices. In Hardware Accelerator Systems for Artificial Intelligence and Machine Learning; Kim, S., Deka, G.C., Eds.; Elsevier: Amsterdam, The Netherlands, 2021; Volume 122, pp. 247–301. [Google Scholar] [CrossRef]
- Wang, Y.; Li, B.; Luo, R.; Chen, Y.; Xu, N.; Yang, H. Energy efficient neural networks for big data analytics. In Proceedings of the 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE), Dresden, Germany, 24–28 March 2014; pp. 1–2. [Google Scholar] [CrossRef]
- Lee, E.H.; Miyashita, D.; Chai, E.; Murmann, B.; Wong, S.S. LogNet: Energy-efficient neural networks using logarithmic computation. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5900–5904. [Google Scholar] [CrossRef]
- Wang, X.; Magno, M.; Cavigelli, L.; Benini, L. FANN-on-MCU: An Open-Source Toolkit for Energy-Efficient Neural Network Inference at the Edge of the Internet of Things. IEEE Internet Things J.
**2020**, 7, 4403–4417. [Google Scholar] [CrossRef] - Jiao, X.; Akhlaghi, V.; Jiang, Y.; Gupta, R.K. Energy-efficient neural networks using approximate computation reuse. In Proceedings of the 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), Dresden, Germany, 19–23 March 2018; pp. 1223–1228. [Google Scholar] [CrossRef]
- Tasoulas, Z.G.; Zervakis, G.; Anagnostopoulos, I.; Amrouch, H.; Henkel, J. Weight-Oriented Approximation for Energy-Efficient Neural Network Inference Accelerators. IEEE Trans. Circuits Syst. I Regul. Pap.
**2020**, 67, 4670–4683. [Google Scholar] [CrossRef] - Berman, D.; Levy, D.; Avidan, S.; Treibitz, T. Underwater Single Image Color Restoration Using Haze-Lines and a New Quantitative Dataset. IEEE Trans. Pattern Anal. Mach. Intell.
**2021**, 43, 2822–2837. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar] [CrossRef] [Green Version]
- Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. arXiv
**2019**, arXiv:1905.02244. [Google Scholar] - Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. arXiv
**2019**, arXiv:1709.01507. [Google Scholar] - Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv
**2017**, arXiv:1704.04861. [Google Scholar] - Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef] [Green Version]
- Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv
**2020**, arXiv:1905.11946. [Google Scholar] - Zoph, B.; Vasudevan, V.; Shlens, J.; Le, Q. Learning Transferable Architectures for Scalable Image Recognition. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8697–8710. [Google Scholar] [CrossRef] [Green Version]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv
**2014**, arXiv:1412.6980. [Google Scholar] - Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. Adv. Neural Inf. Process. Syst.
**2014**, 27, 2366–2374. [Google Scholar] - Peng, Y.T.; Cosman, P.C. Underwater Image Restoration Based on Image Blurriness and Light Absorption. IEEE Trans. Image Process.
**2017**, 26, 1579–1594. [Google Scholar] [CrossRef] [PubMed] - Ghose, S.; Yaglikçi, A.G.; Gupta, R.; Lee, D.; Kudrolli, K.; Liu, W.X.; Hassan, H.; Chang, K.K.; Chatterjee, N.; Agrawal, A.; et al. What your DRAM power models are not telling you: Lessons from a detailed experimental study. Proc. ACM Meas. Anal. Comput. Syst.
**2018**, 2, 1–41. [Google Scholar] [CrossRef]

**Figure 1.**Graphical overview with respective input shapes of the proposed encoder–decoder structure. The number of channels (c) for MobileNetV3 and MobileNetV3${}_{LMin}$ are, respectively, $[16,16,72,96]$${}_{S75}$ and $[16,64,72,240]$.

**Figure 2.**Graphical comparison between different MobileNetV3 configurations tested over the NYU Depth v2 dataset. The RMSE is reported in each graph in different ranges due to the respective error distribution and frame-per-seconds (fps). The red-dotted line is used to represent the real-time frame rate i.e., 30 fps, while the colored dotted segments in orange and light blue, respectively, are used to identify the best models. (

**a**) The evaluated models for ARM CPU, 32-bit floating point precision. In orange, the best configuration, named MobileNetV3${}_{S75}$ (orange dot); (

**b**) The evaluated models for Edge TPU, 8-bit integer precision. In light blue, the best configuration, named MobileNetV3${}_{LMin}$ (light-blue dot); (

**c**) The list of the compared models.

**Figure 3.**MobileNetV3${}_{S75}$ qualitative results over the SQUID dataset. The depth maps are all resized to the same image resolution and converted in RGB format with a perceptually uniform colormap (plasma-reversed) extracted from the ground truth, for a better view. Yellow ground truth depth-map pixels represent missing depth measurements.

**Figure 4.**The conceptual diagram of the measurement setting. Device A performs the inference, while Device B collects the measurement data of the current and voltage from the INA226 sensor via the I2C protocol.

**Figure 5.**Trace of instant current and voltage obtained during the inference test for the model MobileNetV3${}_{LMin}$. In the figure, the current signal is zoomed in order to highlight the behavior of the current during the inference of the single frames. The figure shows phase A before running the benchmark script, phase B which is a sleep phase of 3 s, phase C in which the inference is performed continuously for 20 s and, finally, phase D where the system returns idle.

**Figure 6.**Trace of instant current, voltage and power obtained during the inference test for the model MobileNetV3${}_{LMin}$ but with the dataset entirely loaded in RAM (during phase B). In the figure, the current signal is zoomed in order to highlight the behavior of the current during the inference of the single frames. The figure shows phase A, before running the benchmark script, phase B, in which the dataset is entirely loaded in RAM, phase C, which is a sleep phase of 3 s, phase D, in which the inference is performed continuously for 20 s and, finally, phase E where the system returns idle.

**Figure 7.**Results of the Student’s t-test on the average power consumption during inference (${P}_{{a}_{i}}$) on every model i. The figures show that models with a similar architecture have comparable power consumption while others instead consume less or more power in a statistically significant way. The maximum difference in power consumption is in the order of 0.3 W. (

**a**) Significance level of the Student’s t-test across all the models’ combinations. A blue square means that, given two models i and j, the difference between ${P}_{{a}_{i}}$ and ${P}_{{a}_{j}}$ is statistically significant (p-value $=0.05$). (

**b**) The ${P}_{{a}_{i}}$ for every model listed in Table 6. The horizontal error bar has the same meaning as Figure 7a and describes the range in which the value is not statistically significant with respect to the others.

**Table 1.**Depth estimation comparison of lightweight pre-trained encoders (32-bit float) with the TDSConv2D as an up-sampling block over the NYU Depth v2 dataset. The best results are in bold and the second best is underlined.

Method | RMSE↓ | REL↓ | ${\mathit{\delta}}_{1}\uparrow $ | CPU↑ |
---|---|---|---|---|

[dm] | [fps] | |||

Eff.NetB0 [32] | 6.01 | 0.179 | 0.728 | 4.4 |

NasNetMob. [33] | 8.70 | 0.276 | 0.539 | 6.5 |

Mob.NetV1 [30] | 5.70 | 0.165 | 0.760 | 8.8 |

Mob.NetV2 [31] | 5.72 | 0.169 | 0.759 | 11.3 |

Mob.NetV3${}_{S}$ [28] | 6.77 | 0.207 | 0.682 | 26.2 |

Mob.NetV3${}_{L}$ [28] | 6.39 | 0.195 | 0.698 | 13.4 |

**Table 2.**Quantitative evaluation of the proposed models, the 32-bit floating point and the 8-bit integer precision, inferring on the ARM CPU and the Edge TPU.

Method | Type | RMSE↓ | REL↓ | ${\mathit{\delta}}_{1}\uparrow $ |
---|---|---|---|---|

[dm] | ||||

Mob.NetV3${}_{S75}$ | 32-bit | 6.86 | 0.209 | 0.666 |

Mob.NetV3${}_{LMin}$ | 8-bit | 11.54 | 0.429 | 0.412 |

**Table 3.**Comparison of different decoders with the proposed encoders tested over the NYU Depth v2 dataset.

Up-Sampling Block | MobileNetV3${}_{\mathit{S}75}$ (32-bit Float) | MobileNetV3${}_{\mathit{L}\mathit{M}\mathit{i}\mathit{n}}$ (8-bit Int) | ||||||
---|---|---|---|---|---|---|---|---|

RMSE↓ | REL↓ | ${\mathit{\delta}}_{1}\uparrow $ | CPU↑ | RMSE↓ | REL↓ | ${\mathit{\delta}}_{1}\uparrow $ | TPU↑ | |

[dm] | [fps] | [dm] | [fps] | |||||

UpConv2D | 6.95 | 0.211 | 0.664 | 15.2 | 12.27 | 0.360 | 0.376 | 16.9 |

UpDSConv2D | 7.02 | 0.212 | 0.660 | 31.1 | 13.55 | 0.399 | 0.283 | 28.3 |

NNConv5 [12] | 7.69 | 0.236 | 0.608 | 6.8 | 21.86 | 0.654 | 0.030 | 9.5 |

TConv2D | 6.88 | 0.204 | 0.669 | 18.1 | 17.63 | 0.836 | 0.218 | 17.5 |

TDSConv2D | 6.86 | 0.209 | 0.666 | 29.6 | 11.54 | 0.429 | 0.412 | 26.8 |

**Table 4.**Comparison of the proposed encoder–decoder models with different input–output resolutions on the NYU Depth v2 dataset.

Resolutions | MobileNetV3${}_{\mathit{S}75}$ (32-bit Float) | MobileNetV3${}_{\mathit{LMin}}$ (8-bit Int) | |||||||
---|---|---|---|---|---|---|---|---|---|

Input | Output | RMSE↓ | REL↓ | ${\mathbf{\delta}}_{\mathbf{1}}\uparrow $ | CPU↑ | RMSE↓ | REL↓ | ${\mathbf{\delta}}_{\mathbf{1}}\uparrow $ | TPU↑ |

[dm] | [fps] | [dm] | [fps] | ||||||

$192\times 256$ | $192\times 256$ | 6.38 | 0.194 | 0.687 | 5.8 | 20.19 | 0.961 | 0.051 | 4.5 |

$192\times 256$ | $96\times 128$ | 7.09 | 0.215 | 0.652 | 9.6 | 12.85 | 0.363 | 0.342 | 6.6 |

$192\times 256$ | $48\times 64$ | 7.01 | 0.211 | 0.654 | 13.9 | 12.32 | 0.314 | 0.349 | 7.9 |

$96\times 128$ | $96\times 128$ | 8.07 | 0.244 | 0.571 | 20.3 | 11.56 | 0.413 | 0.407 | 17.3 |

$96\times 128$ | $48\times 64$ | 6.86 | 0.209 | 0.666 | 29.6 | 11.54 | 0.429 | 0.412 | 26.8 |

**Table 5.**Generalization capability comparison over the SQUID dataset [26].

Method | RMSE↓ | REL↓ | ${\mathit{\delta}}_{1}\uparrow $ | CPU↑ | Parameters |
---|---|---|---|---|---|

[m] | [fps] | [M] | |||

DenseDetpth [5] | 5.23 | 5.275 | 0.047 | $<1$ | 42.6 |

FastDepth [12] | 5.17 | 5.493 | 0.055 | 2.0 | 3.9 |

SPEED [11] | 4.49 | 4.732 | 0.088 | 6.2 | 2.6 |

Mob.NetV3${}_{S75}$ | 4.49 | 4.956 | 0.089 | 29.6 | 1.1 |

**Table 6.**The list of all the tested models sorted by the ${P}_{{a}_{i}}$ parameter, which is the energy consumption of the Google Coral Devboard in Watts during the inference of the specified model.

${\mathit{P}}_{{\mathit{a}}_{\mathit{i}}}$ (p = 0.05) (W) | |||||||||
---|---|---|---|---|---|---|---|---|---|

i | Encoder | Decoder | In. | Out. | FPS | FLOPs | low. | mean | upp. |

1 | NasNetMob. | TDSConv2D | 96 × 128 | 48 × 64 | 6.76 | 290.4M | 4.468 | 4.489 | 4.510 |

2 | Mob.NetV3${}_{S75}$ | TDSConv2D | 96 × 128 | 96 × 128 | 18.25 | 52.2M | 4.501 | 4.513 | 4.525 |

3 | Mob.NetV3${}_{S75}$ | UpDSConv2D | 96 × 128 | 48 × 64 | 36.44 | 47.1M | 4.502 | 4.522 | 4.541 |

4 | Eff.NetB0 | TDSConv2D | 96 × 128 | 48 × 64 | 4.49 | 267.8M | 4.505 | 4.529 | 4.552 |

5 | Mob.NetV3${}_{SHD}$ | TDSConv2D | 96 × 128 | 48 × 64 | 41.54 | 36.3M | 4.509 | 4.529 | 4.550 |

6 | Mob.NetV3${}_{LHD}$ | TDSConv2D | 96 × 128 | 48 × 64 | 16.88 | 119.9M | 4.525 | 4.543 | 4.560 |

7 | Mob.NetV3${}_{LMinHD}$ | TDSConv2D | 96 × 128 | 48 × 64 | 21.52 | 107.8M | 4.533 | 4.554 | 4.575 |

8 | Mob.NetV3${}_{S}$ | TDSConv2D | 96 × 128 | 48 × 64 | 28.07 | 46.7M | 4.538 | 4.558 | 4.577 |

9 | Mob.NetV3${}_{S75}$ | TDSConv2D | 96 × 128 | 48 × 64 | 31.14 | 39.2M | 4.546 | 4.562 | 4.579 |

10 | Mob.NetV3${}_{S75}$ | TDSConv2D | 192 × 256 | 192 × 256 | 4.89 | 206.0M | 4.540 | 4.563 | 4.586 |

11 | Mob.NetV3${}_{LMin}$ | TDSConv2D | 96 × 128 | 96 × 128 | 13.08 | 124.5M | 4.541 | 4.563 | 4.586 |

12 | Mob.NetV3${}_{LMin}$ | TDSConv2D | 192 × 256 | 192 × 256 | 3.41 | 497.9M | 4.545 | 4.567 | 4.589 |

13 | Mob.NetV3${}_{LMin}$ | UpDSConv2D | 96 × 128 | 48 × 64 | 18.88 | 127.0M | 4.553 | 4.573 | 4.594 |

14 | Mob.NetV3${}_{L50}$ | TDSConv2D | 96 × 128 | 48 × 64 | 25.17 | 55.2M | 4.550 | 4.575 | 4.600 |

15 | Mob.NetV3${}_{L}$ | TDSConv2D | 96 × 128 | 48 × 64 | 14.10 | 133.5M | 4.558 | 4.576 | 4.595 |

16 | Mob.NetV3${}_{S50}$ | TDSConv2D | 96 × 128 | 48 × 64 | 35.88 | 27.4M | 4.561 | 4.579 | 4.598 |

17 | Mob.NetV2 | TDSConv2D | 96 × 128 | 48 × 64 | 11.59 | 180.9M | 4.566 | 4.584 | 4.601 |

18 | Mob.NetV3${}_{L75}$ | TDSConv2D | 96 × 128 | 48 × 64 | 17.38 | 100.3M | 4.565 | 4.589 | 4.613 |

19 | Mob.NetV3${}_{LMin}$ | TDSConv2D | 192 × 256 | 96 × 128 | 4.60 | 476.0M | 4.572 | 4.589 | 4.607 |

20 | Mob.NetV3${}_{SMin}$ | TDSConv2D | 96 × 128 | 48 × 64 | 33.45 | 39.7M | 4.580 | 4.606 | 4.632 |

21 | Mob.NetV3${}_{LMin}$ | TDSConv2D | 192 × 256 | 48 × 64 | 5.50 | 454.6M | 4.581 | 4.608 | 4.634 |

22 | Mob.NetV3${}_{LMin}$ | TDSConv2D | 96 × 128 | 48 × 64 | 17.33 | 119.0M | 4.582 | 4.611 | 4.639 |

23 | Mob.NetV1 | TDSConv2D | 96 × 128 | 48 × 64 | 9.22 | 309.1M | 4.602 | 4.629 | 4.656 |

24 | Mob.NetV3${}_{SHD}$ | TDSConv2D | 96 × 128 | 48 × 64 | 27.38 | 67.8M | 4.622 | 4.649 | 4.677 |

25 | Mob.NetV3${}_{LMin}$ | TConv2D | 96 × 128 | 48 × 64 | 11.27 | 240.9M | 4.629 | 4.655 | 4.680 |

26 | Mob.NetV3${}_{LMin}$ | UpConv2D | 96 × 128 | 48 × 64 | 10.42 | 297.6M | 4.638 | 4.659 | 4.681 |

27 | Mob.NetV3${}_{S75}$ | TConv2D | 96 × 128 | 48 × 64 | 16.95 | 136.4M | 4.642 | 4.668 | 4.694 |

28 | Mob.NetV3${}_{S75}$ | TDSConv2D | 192 × 256 | 96 × 128 | 4.69 | 542.9M | 4.650 | 4.678 | 4.706 |

29 | Mob.NetV3${}_{S75}$ | UpConv2D | 96 × 128 | 48 × 64 | 15.87 | 185.5M | 4.651 | 4.678 | 4.706 |

30 | Mob.NetV3${}_{S75}$ | TDSConv2D | 192 × 256 | 48 × 64 | 6.79 | 410.7M | 4.660 | 4.692 | 4.725 |

31 | Mob.NetV3${}_{L200}$ | TDSConv2D | 96 × 128 | 48 × 64 | 5.47 | 440.0M | 4.662 | 4.701 | 4.739 |

32 | Mob.NetV3${}_{S200}$ | TDSConv2D | 96 × 128 | 48 × 64 | 5.74 | 485.1M | 4.704 | 4.732 | 4.760 |

33 | Mob.NetV3${}_{LMin}$ | NNConv5 | 96 × 128 | 48 × 64 | 5.20 | 649.9M | 4.746 | 4.771 | 4.797 |

34 | Mob.NetV3${}_{S75}$ | NNConv5 | 96 × 128 | 48 × 64 | 6.85 | 479.3M | 4.750 | 4.780 | 4.810 |

**Table 7.**The Pearson correlation between models FLOPs, FPS and ${P}_{{a}_{i}}$. (**) The correlation is significant at the 0.01 level (2-tailed).

FLOPs | P${}_{{\mathit{a}}_{\mathit{i}}}$ | FPS | ||
---|---|---|---|---|

FLOPs | Correlation | 1 | 0.319 ** | −0.812 ** |

Sig. (2-tailed) | <0.001 | <0.001 | ||

N | 3400 | 3400 | 3400 | |

P${}_{{a}_{i}}$ | Correlation | 0.319 ** | 1 | −0.208 ** |

Sig. (2-tailed) | <0.001 | <0.001 | ||

N | 3400 | 3400 | 3400 | |

FPS | Correlation | −0.812 ** | −0.208 ** | 1 |

Sig. (2-tailed) | <0.001 | <0.001 | ||

N | 3400 | 3400 | 3400 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Papa, L.; Proietti Mattia, G.; Russo, P.; Amerini, I.; Beraldi, R.
Lightweight and Energy-Aware Monocular Depth Estimation Models for IoT Embedded Devices: Challenges and Performances in Terrestrial and Underwater Scenarios. *Sensors* **2023**, *23*, 2223.
https://doi.org/10.3390/s23042223

**AMA Style**

Papa L, Proietti Mattia G, Russo P, Amerini I, Beraldi R.
Lightweight and Energy-Aware Monocular Depth Estimation Models for IoT Embedded Devices: Challenges and Performances in Terrestrial and Underwater Scenarios. *Sensors*. 2023; 23(4):2223.
https://doi.org/10.3390/s23042223

**Chicago/Turabian Style**

Papa, Lorenzo, Gabriele Proietti Mattia, Paolo Russo, Irene Amerini, and Roberto Beraldi.
2023. "Lightweight and Energy-Aware Monocular Depth Estimation Models for IoT Embedded Devices: Challenges and Performances in Terrestrial and Underwater Scenarios" *Sensors* 23, no. 4: 2223.
https://doi.org/10.3390/s23042223