PVConv: Enhancing Depthwise Separable Convolution via Preference-Value Learning for Similar-Feature Discrimination
Abstract
1. Introduction
2. Related Work
2.1. Evolution of Efficient CNN Architectures
2.2. Advances in Efficient Convolution Operators
2.3. Food and Beverage Object Detection
3. Materials and Methods
3.1. Depthwise Separable Convolution
- Depthwise Convolution (DWConv): Applies a K × K kernel to each input channel independently, focusing purely on spatial feature extraction without inter-channel mixing.
- Pointwise Convolution (PWConv): Performs a 1 × 1 convolution across all depthwise outputs to achieve inter-channel feature fusion.
3.2. Preference-Value Convolution
- Nonlinear preference activation: For each position (h, w) and input channel m, the feature value xb,m,h,w (where b is the batch index) is modulated according to its proximity to the filter’s preference value :where f(·) emphasizes values close to the preferred range and attenuates others. The resulting activation map reflects how well the input aligns with each filter’s learned preference.
- Linear aggregation: The preference-activated features are linearly combined using filter weights wjm:where j indexes output channels.
3.3. Nonlinear Preference Activation Functions
3.3.1. Gaussian-Type Activation
3.3.2. Laplace-Type Activation
3.4. Preference-Value Depthwise Separable Convolution
3.4.1. Main Branch: Pointwise Convolution
3.4.2. PVConv Auxiliary Branch
3.4.3. Batch Normalization
3.4.4. Dimensional Alignment
3.4.5. Preference-Based Feature Fusion
3.4.6. Parameter and Computational Cost of PVDSC
4. Experiments and Analysis
4.1. Experimental Framework
- YOLOv8-Conv: baseline with standard convolutions.
- YOLOv8-DSC: lightweight variant replacing eligible convolutions with DSC.
- YOLOv8-PVDSC: the proposed model, replacing DSC layers with PVDSC modules.
4.2. Dataset
Random Subsampling and Bias Mitigation
4.3. Experimental Environment and Settings
Preference Parameter Scaling
4.4. Evaluation Metrics
- Precision (P) and Recall (R):where TP (true positives) and FP (false positives) are determined by both class prediction and Intersection over Union (IoU) with ground-truth boxes. FN denotes false negatives.
- F1-score:
- Average Precision (AP) and Mean Average Precision (mAP): For each class, the precision–recall curve is computed by varying the confidence threshold. AP is the area under the precision–recall curve:Mean Average Precision (mAP) is obtained by averaging the AP over all classes:
- –
- mAP@50: IoU threshold = 0.5
- –
- mAP@50:95: average over IoU thresholds from 0.5 to 0.95 with step 0.05
- Model Efficiency: FLOPs and number of parameters
4.5. Analysis of Experimental Results
4.5.1. Parameter and Computational Analysis
4.5.2. Analysis of Detection Performance
4.5.3. Analysis of Precision
4.5.4. F1-Score Curve Analysis
4.5.5. Analysis of Similar-Class and Background Detections
4.5.6. Generalization and Model Transfer Experiments
4.5.7. Evaluation on Additional Benchmark Datasets
4.5.8. Visualization of Detection Results and Heatmaps
5. Discussion and Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; Pereira, F., Burges, C., Bottou, L., Weinberger, K., Eds.; Curran Associates, Inc.: Nice, France, 2012; Volume 25. [Google Scholar]
- Ghimire, D.; Kil, D.; Kim, S.h. A Survey on Efficient Convolutional Neural Networks and Hardware Acceleration. Electronics 2022, 11, 945. [Google Scholar] [CrossRef]
- Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
- Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features From Cheap Operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of Machine Learning Research, Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; Chaudhuri, K., Salakhutdinov, R., Eds.; PMLR: Cambridge, MA, USA, 2019; Volume 97, pp. 6105–6114. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Wang, Y.; Feng, B.; Ding, Y. DSXplore: Optimizing Convolutional Neural Networks via Sliding-Channel Convolutions. In Proceedings of the 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Portland, OR, USA, 17–21 May 2021; pp. 619–628. [Google Scholar] [CrossRef]
- Haase, D.; Amthor, M. Rethinking Depthwise Separable Convolutions: How Intra-Kernel Correlations Lead to Improved MobileNets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
- Han, Z.; Jian, M.; Wang, G.G. ConvUNeXt: An efficient convolution neural network for medical image segmentation. Knowl.-Based Syst. 2022, 253, 109512. [Google Scholar] [CrossRef]
- Wan, G.; Yao, L. LMFRNet: A Lightweight Convolutional Neural Network Model for Image Analysis. Electronics 2024, 13, 129. [Google Scholar]
- Zhang, N.; Nex, F.; Vosselman, G.; Kerle, N. Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 18537–18546. [Google Scholar]
- Maaz, M.; Shaker, A.; Cholakkal, H.; Khan, S.; Zamir, S.W.; Anwer, R.M.; Shahbaz Khan, F. EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications. In Computer Vision, Proceedings of the ECCV 2022 Workshops, Tel Aviv, Israel, 23–27 October 2022; Karlinsky, L., Michaeli, T., Nishino, K., Eds.; Springer: Cham, Switzerland, 2023; pp. 3–20. [Google Scholar]
- Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. RepViT: Revisiting Mobile CNN From ViT Perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 15909–15920. [Google Scholar]
- Zhang, Y.; Ding, X.; Yue, X. Scaling Up Your Kernels: Large Kernel Design in ConvNets Toward Universal Representations. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 11692–11707. [Google Scholar] [CrossRef] [PubMed]
- Li, D.; Li, L.; Chen, Z.; Li, J. ShiftwiseConv: Small Convolutional Kernel with Large Kernel Effect. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; pp. 25281–25291. [Google Scholar]
- Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Nascimento, M.G.d.; Fawcett, R.; Prisacariu, V.A. DSConv: Efficient Convolution Operator. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Wei, T.; Tian, Y.; Wang, Y.; Liang, Y.; Chen, C.W. Optimized separable convolution: Yet another efficient convolution operator. AI Open 2022, 3, 162–171. [Google Scholar] [CrossRef]
- Zhang, X.; Song, Y.; Song, T.; Yang, D.; Ye, Y.; Zhou, J.; Zhang, L. LDConv: Linear deformable convolution for improving convolutional neural networks. Image Vis. Comput. 2024, 149, 105190. [Google Scholar] [CrossRef]
- Munir, M.; Rahman, M.M.; Marculescu, R. RapidNet: Multi-Level Dilated Convolution Based Mobile Backbone. In Proceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Tucson, AZ, USA, 26 February–6 March 2025; pp. 8302–8312. [Google Scholar] [CrossRef]
- Zhang, T.; Xu, W.; Luo, B.; Wang, G. Depth-Wise Convolutions in Vision Transformers for efficient training on small datasets. Neurocomputing 2025, 617, 128998. [Google Scholar] [CrossRef]
- Zhang, H.; Li, D.; Ji, Y.; Zhou, H.; Wu, W.; Liu, K. Toward New Retail: A Benchmark Dataset for Smart Unmanned Vending Machines. IEEE Trans. Ind. Inform. 2020, 16, 7722–7731. [Google Scholar] [CrossRef]
- Hossain, M.B.; Sazonov, E. Enhancing Egocentric Insights: Comparison of Deep Learning Models for Food and Beverage Object Detection from Egocentric Images. In Proceedings of the 2024 International Symposium on Sensing and Instrumentation in 5G and IoT Era (ISSI), Lagoa, Portugal, 29–30 August 2024; Volume 1, pp. 1–6. [Google Scholar] [CrossRef]
- Chollet, F. Xception: Deep Learning With Depthwise Separable Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Broomhead, D.S.; Lowe, D. Radial basis functions, multi-variable functional interpolation and adaptive networks. Complex Syst. 1988, 2, 321–355. [Google Scholar]
- Hertz, J.; Krogh, A.; Palmer, R.G. Introduction to the Theory of Neural Computation; CRC Press, Taylor & Francis Group: Boca Raton, FL, USA, 2018. [Google Scholar] [CrossRef]
- Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of Machine Learning Research, Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015; Bach, F., Blei, D., Eds.; PMLR: Cambridge, MA, USA, 2015; Volume 37, pp. 448–456. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Projects, R.U. Beverage Containers Dataset. 2024. Available online: https://universe.roboflow.com/roboflow-universe-projects/beverage-containers-3atxb (accessed on 24 October 2025).
- Antoreepjana. Animals Detection Images Dataset. Available online: https://www.kaggle.com/datasets/antoreepjana/animals-detection-images-dataset (accessed on 29 November 2025).
- Yudin, D.; Zakharenko, N.; Smetanin, A.; Filonov, R.; Kichik, M.; Kuznetsov, V.; Larichev, D.; Gudov, E.; Budennyy, S.; Panov, A. Hierarchical waste detection with weakly supervised segmentation in images from recycling plants. Eng. Appl. Artif. Intell. 2024, 128, 107542. [Google Scholar] [CrossRef]
- Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-CAM++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 839–847. [Google Scholar] [CrossRef]














| Component | Parameters | FLOPs |
|---|---|---|
| Depthwise Convolution | K2 × M | K2 × M × H × W |
| Pointwise Convolution | M × N | M × N × H × W |
| PVConv | 2 × M × Nmid | (n + 1) × M × Nmid × H × W |
| 1 × 1 Convolution | Nmid × N | Nmid × N × H × W |
| Hyperparameter | Value |
|---|---|
| Epochs | 300 |
| Image size | 512 |
| Batch size | 32 |
| Optimizer | SGD |
| Initial learning rate (lr0) | 0.01 |
| Learning rate of pref (lr0_pref) | 0.1 |
| Learning rate factor (lrf) | 0.01 |
| Momentum | 0.937 |
| Weight decay | 0.0005 |
| Layer | Input/Output Size | Conv Type | Parameters | FLOPs |
|---|---|---|---|---|
| Layer 3 | 64 × 160 × 160 128 × 80 × 80 | Conv | 73,728 | 943,718,400 |
| DSC | 8768 | 112,230,400 | ||
| PVDSC_L | 25,152 | 321,945,600 | ||
| PVDSC_G | 25,152 | 321,945,600 | ||
| Layer 5 | 128 × 80 × 80 256 × 40 × 40 | Conv | 294,912 | 943,718,400 |
| DSC | 33,920 | 108,544,000 | ||
| PVDSC_L | 99,456 | 318,259,200 | ||
| PVDSC_G | 99,456 | 318,259,200 | ||
| Layer 7 | 256 × 40 × 40 512 × 20 × 20 | Conv | 1,179,648 | 943,718,400 |
| DSC | 133,376 | 106,700,800 | ||
| PVDSC_L | 395,520 | 316,416,000 | ||
| PVDSC_G | 395,520 | 316,416,000 | ||
| Layer 16 | 128 × 80 × 80 128 × 40 × 40 | Conv | 147,456 | 471,859,200 |
| DSC | 17,536 | 56,115,200 | ||
| PVDSC_L | 42,112 | 147,865,600 | ||
| PVDSC_G | 42,112 | 147,865,600 | ||
| Layer 19 | 256 × 40 × 40 256 × 20 × 20 | Conv | 589,824 | 471,859,200 |
| DSC | 67,840 | 54,272,000 | ||
| PVDSC_L | 166,144 | 146,022,400 | ||
| PVDSC_G | 166,144 | 146,022,400 |
| Module | mAP@50 | mAP@50:95 | FLOPs (G) | Params (M) | Epochs | mAP@50 (Mean ± Std) | mAP@50:95 (Mean ± Std) | FPS (Ref.) | Memory (GB) |
|---|---|---|---|---|---|---|---|---|---|
| Conv | 0.930 | 0.773 | 28.7 | 11.14 | 205 | 0.9294 ± 0.0097 | 0.7703 ± 0.0053 | 108.42 | 6.56 |
| DSC | 0.917 | 0.766 | 25.3 | 9.11 | 226 | 0.9186 ± 0.0112 | 0.7633 ± 0.0064 | 125.85 | 6.66 |
| PVDSC_G | 0.923 | 0.775 | 26.18 | 9.58 | 261 | 0.9246 ± 0.0103 | 0.7691 ± 0.0060 | 112.85 | 6.96 |
| PVDSC_L | 0.920 | 0.776 | 26.18 | 9.58 | 297 | 0.9231 ± 0.0114 | 0.7710 ± 0.0066 | 114.15 | 6.96 |
| From → To | DSC | PVDSC_G | PVDSC_L |
|---|---|---|---|
| bottle-plastic → bottle-glass | 20 | 14 (↓30%) | 9 (↓55%) |
| gym-bottle → bottle-glass | 1 | 0 (↓100%) | 0 (↓100%) |
| background → bottle-glass | 34 | 19 (↓44%) | 24 (↓29%) |
| bottle-glass → bottle-plastic | 15 | 9 (↓40%) | 15 (—) |
| gym-bottle → bottle-plastic | 13 | 11 (↓15%) | 6 (↓54%) |
| background → bottle-plastic | 30 | 24 (↓20%) | 24 (↓20%) |
| bottle-glass → gym-bottle | 6 | 8 (↑33%) | 5 (↓17%) |
| bottle-plastic → gym-bottle | 5 | 3 (↓40%) | 4 (↓20%) |
| background → gym-bottle | 19 | 14 (↓26%) | 11 (↓42%) |
| Total False Positives | 143 | 102 (↓29%) | 98 (↓31%) |
| Model | mAP@50:95 | mAP@50 | Params (M) | FLOPs (G) |
|---|---|---|---|---|
| YOLOv5s | 0.76628 | 0.92263 | 9.125 | 24.1 |
| YOLOv5s-DSC | 0.74614 | 0.92087 | 7.103 | 20.7 |
| YOLOv5s-PVDSC | 0.76431 | 0.92345 | 7.568 | 21.57 |
| YOLOv10s | 0.78114 | 0.92331 | 8.073 | 24.8 |
| YOLOv10s-DSC | 0.76766 | 0.91241 | 7.875 | 22.8 |
| YOLOv10s-PVDSC | 0.77631 | 0.92031 | 8.342 | 23.6 |
| YOLOv11s | 0.77642 | 0.92148 | 9.431 | 21.6 |
| YOLOv11s-DSC | 0.75514 | 0.90324 | 7.083 | 16.6 |
| YOLOv11s-PVDSC | 0.77012 | 0.91913 | 7.589 | 17.7 |
| Dataset | Method | mAP@50 | mAP@50:95 | mAP@50 (Mean ± Std) | mAP@50:95 (Mean ± Std) |
|---|---|---|---|---|---|
| Bird Subset | Conv | 0.6898 | 0.5380 | 0.6738 ± 0.0113 | 0.5239 ± 0.0073 |
| DSConv | 0.6699 | 0.5099 | 0.6515 ± 0.0119 | 0.5018 ± 0.0059 | |
| PVDSC_G | 0.6787 | 0.5275 | 0.6658 ± 0.0110 | 0.5143 ± 0.0084 | |
| WaRP Dataset | Conv | 0.4583 | 0.3231 | 0.4467 ± 0.0123 | 0.3163 ± 0.0079 |
| DSConv | 0.4512 | 0.3080 | 0.4346 ± 0.0107 | 0.3001 ± 0.0077 | |
| PVDSC_G | 0.4649 | 0.3238 | 0.4377 ± 0.0180 | 0.3136 ± 0.0103 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Peng, W.; Li, B.; Wang, P.; Huang, H.; Zou, Y.; Qiao, X. PVConv: Enhancing Depthwise Separable Convolution via Preference-Value Learning for Similar-Feature Discrimination. Electronics 2025, 14, 4978. https://doi.org/10.3390/electronics14244978
Peng W, Li B, Wang P, Huang H, Zou Y, Qiao X. PVConv: Enhancing Depthwise Separable Convolution via Preference-Value Learning for Similar-Feature Discrimination. Electronics. 2025; 14(24):4978. https://doi.org/10.3390/electronics14244978
Chicago/Turabian StylePeng, Weixiong, Bingyan Li, Ping Wang, Huiping Huang, Yangyang Zou, and Xiaoli Qiao. 2025. "PVConv: Enhancing Depthwise Separable Convolution via Preference-Value Learning for Similar-Feature Discrimination" Electronics 14, no. 24: 4978. https://doi.org/10.3390/electronics14244978
APA StylePeng, W., Li, B., Wang, P., Huang, H., Zou, Y., & Qiao, X. (2025). PVConv: Enhancing Depthwise Separable Convolution via Preference-Value Learning for Similar-Feature Discrimination. Electronics, 14(24), 4978. https://doi.org/10.3390/electronics14244978

