Next Article in Journal
Beyond Mitigation: New Metrics for Space Sustainability Assessment
Previous Article in Journal
Prediction-Based Tip-Over Prevention for Planetary Exploration Rovers
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Proceeding Paper

Recommender System for Apparel Products Based on Image Recognition Using Convolutional Neural Networks †

1
Department of Computer Science and Information Engineering, Chung Hua University, Hsinchu 30012, Taiwan
2
Ph.D. Program in Engineering Science, Chung Hua University, Hsinchu 30012, Taiwan
3
School of Information Engineering, Shandong Vocational and Technical University of International Studies, Rizhao 276826, China
4
Department of Computer Science and Information Engineering, Ming Chuan University, Taoyuan 48710, Taiwan
*
Author to whom correspondence should be addressed.
Presented at the 2024 IEEE 7th International Conference on Knowledge Innovation and Invention, Nagoya, Japan, 16–18 August 2024.
Eng. Proc. 2025, 89(1), 38; https://doi.org/10.3390/engproc2025089038
Published: 14 March 2025

Abstract

In e-commerce and fashion, personalized recommendations are used to enhance user experience and engagement. In this study, an apparel recognition and recommender system (ARRS) using convolutional neural networks (CNNs) was employed to analyze apparel images, extract features, and provide accurate recognition and recommendations. By learning patterns and features of clothes, the system enables robust recognition and personalized suggestions. The effectiveness of ARRS in recognizing apparel and generating relevant recommendations was validated. The system enhances user satisfaction and engagement on fashion e-commerce platforms.

1. Introduction

Recent advancements in e-commerce technologies have revolutionized online shopping. As online retailers grow and apparel choices multiply, consumers can have personalized recommendations. To meet this demand, we improved the apparel recognition and recommendation processes of deep learning, particularly convolutional neural networks (CNNs) [1].
CNNs, inspired by visual perception, excel at extracting and learning intricate patterns from images, making them ideal for recognizing and categorizing apparel. We explored the design, implementation, and performance evaluation of an apparel recognition and recommender system (ARRS) using CNNs. By analyzing apparel images and extracting key features, the system generates personalized recommendations, aiming to enhance user satisfaction and engagement in fashion e-commerce [2,3]. The proposed system improves the efficiency and effectiveness of online fashion shopping.
This paper is structured as follows: Section 1 introduces the study, Section 2 presents related work, Section 3 outlines the research methodology, Section 4 discusses experiments and analysis, and Section 5 concludes the paper with future research directions.

2. Related Work

2.1. Deep Learning and CNN

Deep learning, a transformative force in artificial intelligence (AI), has revolutionized computer vision, natural language processing, and pattern recognition. Its success underpins innovations such as ChatGPT 3.5 and NVIDIA technologies [4,5]. At its core, deep learning uses neural networks, computational models inspired by the brain, consisting of interconnected layers that extract meaningful patterns from data [1].
CNNs, specialized for image analysis, apply learnable filters to capture patterns at different scales, such as edges and textures [2,3]. CNNs excel in autonomously learning hierarchical features, adjusting parameters through extensive training to accurately recognize complex objects. Key models, including ResNet50 [6,7], VGG19 [8], and YOLOv8 [9,10,11], exemplify CNNs’ effectiveness, with their comparison shown in Table 1.
ResNet50, a deep CNN known for its depth and effectiveness in computer vision, uses residual connections to avoid vanishing gradients, excelling on datasets such as ImageNet [6,7]. VGG19, with its 19-layer architecture, leverages small filters and max-pooling layers to capture complex patterns, making it effective for image classification tasks [8]. You Only Look Once (YOLO) is used for real-time object detection, predicting bounding boxes and class probabilities in a single pass, with YOLOv8 showing strong performance in both detection and classification [9,10,11]. In apparel recognition and recommendation, CNNs are vital for precise image analysis and personalized recommendations. Section 3 details their application in system design.

2.2. Recommender Systems

Recommender systems predict user preferences or ratings, playing a key role in e-commerce, music, and video platforms [12,13,14]. These systems include collaborative filtering, content-based filtering, or hybrids to analyze data for ratings and browsing history. Collaborative filtering looks at user or item similarities, while content-based filtering matches user preferences with item attributes. Hybrid approaches combine these methods for better accuracy. Machine and deep learning enhance these techniques for complex recommendations. Platforms including Amazon, Netflix, and Spotify use them to improve user experience and boost sales. In fashion e-commerce, combining CNN-based image analysis with collaborative filtering provides tailored apparel suggestions.

2.3. Recommenders Based on Deep Learning

Deep learning has become a powerful tool in recommendation systems, enabling personalized content suggestions through its ability to learn complex patterns from raw data [15,16]. Current approaches use neural networks alone or combined with traditional methods. For a more in-depth exploration of deep learning-based recommendation mechanisms, refer to [16].
We enhanced recommender systems with content filtering, collaborative filtering, and deep learning, particularly in integrating content-based features from textual, visual, or multimodal data sources. CNNs process apparel images, while natural language processing handles product descriptions or reviews, improving accuracy. The proposed system integrates deep learning for image analysis and collaborative filtering for recommendations, offering personalized suggestions in fashion e-commerce.

3. Research Methodology

3.1. System Design

This system contains three main modules: data learning, image recognition, and recommendation modules, as shown in Figure 1.
  • The data learning module collects, preprocesses, and trains models, storing the trained model and weights for future use.
  • The image recognition module processes image input and uses the trained model to predict categories.
  • The image recommendation module extracts features, calculates similarity with user ratings, and provides personalized recommendations.

3.2. Data Collection and Preprocessing

The apparel dataset consists of 20 categories of apparel from the 108 categories provided by kaggle.com. The sources of the data are shown in Figure 2 and are organized as follows:
  • The dataset comprises 20 categories of apparel specifically curated for this study.
  • It is sourced from the “Fashion Product Images Dataset”, specifically the 108-category dataset within it.
  • The dataset is a collaboration with Param Aggarwal (Owner).
  • The data originate from the Indian fashion e-commerce website: myntra.com.
  • Dataset source link is https://www.kaggle.com/datasets/paramaggarwal/fashion-product-images-dataset (accessed on 20 March 2024).
We collected preferences from 60 users who viewed a clothing photo dataset with 20 categories, each containing 600 photos, as shown in Figure 3. Users selected 100 favorites per category and put in “fashion_like_100”, 10 favorites and put in “fashion_like_10”, 5 and put in “fashion_like_5”, and finally 1 favorite and put in “fashion_like_1”. Preferences were stored in a CSV file with user IDs and liked item IDs, used as ratings in the recommender system to generate suggestions.
In deep learning, data augmentation is a key to expanding the training dataset [17]. By exposing the model to a wider range of data, it enhances the model’s ability to generalize, adapt to unseen data, and mitigate overfitting [18]. Data augmentation methods, such as scaling, shearing, and horizontal flipping, are commonly employed to achieve this. Scaling adjusts image sizes while preserving proportions, aiding the model in recognizing objects of varying sizes. Shearing alters image shapes to maintain the model’s recognition capability for distorted parts. Horizontal flipping creates mirrored images, diversifying the data and enabling the model to recognize objects from different angles.

3.3. Model Design, Training, and Fine-Tuning

The proposed system adopts ResNet50, VGG19, YOLOv8n, and self-built CNN as deep-learning training modules. ResNet50 and VGG19 modules are chosen due to their advantages and nature along with the novelty of YOLOv8n. CNN’s input image size is 224 × 224, comprising six layers. It initially splits into two branches: one with 64 3 × 3 convolutional kernels with a stride of 1, and another with 32 1 × 1 convolutional kernel. Following this, it integrates 64 3 × 3 convolutional kernels with a stride of 1, utilizing the ReLU activation function and Max pooling. Subsequently, it concatenates and processes with 128 3 × 3 convolutional kernels, ReLU activation function, and Max pooling. Finally, the 2D feature maps are transformed into 1D for output through fully connected layers. Figure 4 depicts this CNN architecture.
The implementation process of the learning procedure is as follows:
  • Dataset construction: A dataset consisting of 20 apparel categories is established, divided into a 60:20:20 ratio for training, validation, and testing, respectively.
  • ResNet50, VGG19, YOLOv8n, self-built CNN training: Each training module runs for 100 iterations, with 64 images processed per iteration, and validation is also conducted for 100 iterations with the same batch size.
  • Testing: Testing involves processing 64 images at a time, with the final output being the accuracy rate.

3.4. Recommendation Engine

The recommendation module generates feature vectors using the selected model. In machine and deep learning, feature extraction is vital, transforming data into mathematical representations across multiple dimensions, each capturing distinct attributes or characteristics.
It calculates the similarity between the feature vectors of the input photo and photos in the database using cosine similarity is used to calculate similarity defined as
Similarity = cos θ   =   A B A B     =   i = 1 n A i × B i i = 1 n A i 2 × i = 1 n B i 2
where A and B are feature vectors [19,20,21].
The average rating of photos in the database is multiplied by the similarity to obtain a score for similarity and rating. Because the system constantly provides recommendations, it can overlook certain items to be suggested. Therefore, mechanisms must be adopted to tackle this issue. In the system, we consider item popularity. To devise a strategy for improved performance, we developed the following strategies based on the selected top-k recommendations.
  • For k = 1, items with high similarity scores but no ratings are prioritized at a 3% probability, with the remaining 97% of recommendations based on comprehensive scores.
  • For k = 5, items with high similarity scores but no ratings are prioritized with a 20% probability, with the remaining 80% of recommendations based on scores.
  • For k = 10, a 5% probability is allocated to include 1 item with a high similarity score but no rating in the recommendations, and a 90% probability to include 2 items with high similarity scores but no ratings. The remaining 5% of recommendations are based on comprehensive scores.

4. Experiments and Results

4.1. Experimental Device

The system was built using the following equipment, and the detailed development environment is presented in Table 2.

4.2. Model Training and Evaluation

In training deep learning networks, the accuracy rate varies as the number of iterations increases, ultimately reaching a saturation state of learning rate. Figure 5 and Figure 6 depict the training graphs for ResNet50 and self-built CNN, respectively. The summary of training results is shown in Table 3. ResNet50, VGG19, and YOLOv8n show higher accuracy than self-built CNN.

4.3. Recommendation Evaluation

In evaluation [19,20], precision and recall are key metrics. Precision is the ratio of true positives to the sum of true and false positives, indicating the model’s accuracy in predicting positive samples. Recall is the ratio of true positives to the sum of true positives and false negatives, reflecting the model’s ability to capture positive samples. F1-score is the harmonic mean of precision and recall, balancing both metrics. We use precision at k, recall at k, and F1-score as evaluation metrics, focusing on the top k recommended items to calculate precision and recall.
  • Precision at k is the value of successfully recommended items out of the total recommended items, where k is the total number of recommended items.
Precision   at   k = #   of   recommended   items   that   has   been   rated #   of   total   recommended   items
  • Recall at k is the number of correctly predicted items among the recommended items.
Recall   at   k = #   of   items   recommended   and   rated #   of   all   rated   items
  • F1-score is the harmonic mean of precision and recall.
F 1 - score = 2   ×   Precision   ×   Recall Precision   +   Recall
The cosine similarity is used to compare four deep-learning models for collaborative filtering. Similarities are computed using a clothing dataset, and a comprehensive performance comparison is constructed using precision at k, recall at k, and F1-score. The summary of precision at k, recall at k, and F1-score is shown in Table 4, Table 5 and Table 6.
Top-10 recommended items present a more feasible solution. In the recommendations, the self-built CNN demonstrates superior precision and recall at K and F1-score. Excessively high precision in image accuracy may limit the variety of items for recommendation.

4.4. Computation Time

We apply four different training models individually to the recommendation system. Subsequently, we compare their operational speed. The summary of training time and recommendation time is shown in Table 7. YOLOv8n has the lowest training time, followed by the self-built CNN. YOLOv8n consistently has the lowest recommendation times across all top-K scenarios, followed by self-built CNN.

5. Conclusions and Future Work

We develop an efficient system that integrates apparel recognition with personalized recommendations using CNNs. While image feature extraction accuracy is important, overly precise accuracy may limit recommendation diversity. Among the four models tested, YOLOv8n shows the highest speed stability, followed by a custom-built CNN, with ResNet-50 and VGG19 trailing in training speed. Custom-built models are trained faster but with lower accuracy, highlighting the need to adjust model complexity based on specific requirements. The developed recommender system offers a foundation for exploring various deep learning frameworks and is adaptable to diverse image datasets.
It is still necessary to enhance recommendation precision by integrating contrastive language–image pretraining (CLIP), which can understand both text and images [22]. Fine-tuning CLIP on fashion descriptions and images can improve recommendation accuracy and enhance user satisfaction and engagement in fashion e-commerce.

Author Contributions

Conceptualization, C.-C.C.; Methodology, C.-C.C., C.-H.W., Y.-H.W. and C.-H.T.Y.; Writing—original draft preparation, C.-C.C.; writing—review and editing, S.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. John, D. Kelleher, Deep Learning; The MIT Press: Cambridge, MA, USA, 2019. [Google Scholar]
  2. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  3. Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar]
  4. Wu, T.; He, S.; Liu, J.; Sun, S.; Liu, K.; Han, Q.-L.; Tang, Y. A brief overview of ChatGPT: The history, status quo and potential future development. IEEE/CAA J. Autom. Sin. 2023, 10, 1122–1136. [Google Scholar] [CrossRef]
  5. Mittal, S. A survey on optimized implementation of deep learning models on the NVIDIA Jetson platform. J. Syst. Archit. 2019, 97, 428–442. [Google Scholar] [CrossRef]
  6. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  7. Koonce, B. ResNet 50. In Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization; Apress: New York, NY, USA, 2021; pp. 63–72. [Google Scholar]
  8. Zisserman, A.; Simonyan, K. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  9. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  10. Jocher, J.G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L.; Poznanski, J.; Yu, L.; Rai, P.; Ferriday, R.; et al. ultralytics/yolov5: V3. 0. Zenodo 2020. [Google Scholar] [CrossRef]
  11. Terven, J.R.; Cordova-Esparza, D.M. A comprehensive review of YOLO: From YOLOv1 to YOLOv8 and beyond. arXiv 2023, arXiv:2304.00501. [Google Scholar]
  12. Jannach, D.; Zanker, M.; Felfernig, A.; Friedrich, G. Recommender Systems: An Introduction; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
  13. Martin, F.J.; Donaldson, J.; Ashenfelter, A.; Torrens, M.; Hangartner, R. The big promise of recommender systems. AI Mag. 2011, 32, 19–27. [Google Scholar] [CrossRef]
  14. Resnick, P.; Varian, H.R. Recommender Systems. Commun. ACM 1997, 40, 56–58. [Google Scholar] [CrossRef]
  15. Cheng, H.-T.; Koc, L.; Harmsen, J.; Shaked, T.; Chandra, T.; Aradhye, H.; Anderson, G.; Corrado, G.; Chai, W.; Ispir, M.; et al. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, Boston, MA, USA, 15 September 2016; pp. 7–10. [Google Scholar]
  16. Zhang, S.; Yao, L.; Sun, A.; Tay, Y. Deep Learning Based Recommender System: A Survey and New Perspectives. ACM Comput. Surv. 2020, 52, 1–38. [Google Scholar] [CrossRef]
  17. Ko, B.; Ok, J. Time Matters in Using Data Augmentation for Vision-Based Deep Reinforcement Learning. arXiv 2021, arXiv:2102.08581. [Google Scholar]
  18. Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, South Korea, 27 October–2 November 2019; pp. 6023–6032. [Google Scholar]
  19. Zuva, K.; Zuva, T. Evaluation of Information Retrieval Systems. Int. J. Comput. Sci. Inf. Technol. (IJCSIT) 2012, 4, 35–43. [Google Scholar] [CrossRef]
  20. Tan, P.-N.; Steinbach, M.; Kumar, V. Introduction to Data Mining, 2nd ed.; Pearson: London, UK, 4 January 2018. [Google Scholar]
  21. Singhal, A. Modern Information Retrieval: A Brief Overview. Bull. IEEE Comput. Soc. Tech. Comm. Data Eng. 2001, 24, 35–43. [Google Scholar]
  22. Li, Y.; Liang, F.; Zhao, L.; Cui, Y.; Ouyang, W.; Shao, J.; Yu, F.; Yan, J. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv 2021, arXiv:2110.05208. [Google Scholar]
Figure 1. System architecture.
Figure 1. System architecture.
Engproc 89 00038 g001
Figure 2. Apparel data source.
Figure 2. Apparel data source.
Engproc 89 00038 g002
Figure 3. Apparel dataset.
Figure 3. Apparel dataset.
Engproc 89 00038 g003
Figure 4. Self-built CNN model.
Figure 4. Self-built CNN model.
Engproc 89 00038 g004
Figure 5. ResNet50 training graph.
Figure 5. ResNet50 training graph.
Engproc 89 00038 g005
Figure 6. Self-built CNN training graph.
Figure 6. Self-built CNN training graph.
Engproc 89 00038 g006
Table 1. Comparison of CNN models.
Table 1. Comparison of CNN models.
ModelResNet50VGG19YOLOv8n
Architecture50 layers19 layers225–230 layers
Convolutional filters3 × 33 × 3Varying sizes
Max pooling2 × 22 × 2Various sizes
Fully connected
layers
Minimal
(Often 1)
Large numberMinimal (might have none)
Top-5 accuracy on ImageNet97.33%97.07%Not available
Data requirementsHigh
(Tens of thousands)
High
(Tens of thousands)
High
(Tens of thousands)
Computational complexityMediumHighLower
InterpretabilityModerateLimitedModerate
Table 2. System development environment.
Table 2. System development environment.
Hardware
ProcessorIntel® Core™ i7-8086K
MemoryDDR4-32G
Graphics CardNVIDIA GeForce GTX 1080Ti (11G)
Hardware
Operating SystemWindows 10
Development ToolsAnaconda3
Python Version3.10
Table 3. Comparison of CNN models (Bold text highlights the highest value).
Table 3. Comparison of CNN models (Bold text highlights the highest value).
Training ModelTraining Set
Accuracy
Validation Set
Accuracy
Test Set Accuracy
ResNet500.99960.9210.9288
VGG190.99870.92970.9258
Self-built CNN0.98970.87920.8842
YOLOv8n0.95170.9220.9233
Table 4. Summary of precision At K (Bold text highlights the highest value).
Table 4. Summary of precision At K (Bold text highlights the highest value).
Training ModelTop-1Top-5Top-10
ResNet500.965410.959250.81454
VGG190.970.961750.81567
Self-built CNN0.968330.963080.81625
YOLOv8n0.970830.962420.81333
Table 5. Summary of recall At K (Bold text highlights the highest value).
Table 5. Summary of recall At K (Bold text highlights the highest value).
Training ModelTop-1Top-5Top-10
ResNet500.099430.182720.61035
VGG190.111360.219860.76828
Self-built CNN0.124320.335330.99172
YOLOv8n0.143960.500670.98776
Table 6. Summary of F1-score (Bold text highlights the highest value).
Table 6. Summary of F1-score (Bold text highlights the highest value).
Training ModelTop-1Top-5Top-10
ResNet500.180290.3070.69782
VGG190.199750.357910.79127
Self-built CNN0.220350.497470.89547
YOLOv8n0.250740.658690.8921
Table 7. Summary of training time (Bold text highlights the highest value).
Table 7. Summary of training time (Bold text highlights the highest value).
Training ModelTraining Set
Accuracy
Validation Set
Accuracy
Test Set Accuracy
ResNet500.99960.9210.9288
VGG190.99870.92970.9258
Self-built CNN0.98970.87920.8842
YOLOv8n0.95170.9220.9233
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chang, C.-C.; Wei, C.-H.; Wang, Y.-H.; Yang, C.-H.T.; Hsiao, S. Recommender System for Apparel Products Based on Image Recognition Using Convolutional Neural Networks. Eng. Proc. 2025, 89, 38. https://doi.org/10.3390/engproc2025089038

AMA Style

Chang C-C, Wei C-H, Wang Y-H, Yang C-HT, Hsiao S. Recommender System for Apparel Products Based on Image Recognition Using Convolutional Neural Networks. Engineering Proceedings. 2025; 89(1):38. https://doi.org/10.3390/engproc2025089038

Chicago/Turabian Style

Chang, Chin-Chih, Chi-Hung Wei, Yen-Hsiang Wang, Chyuan-Huei Thomas Yang, and Sean Hsiao. 2025. "Recommender System for Apparel Products Based on Image Recognition Using Convolutional Neural Networks" Engineering Proceedings 89, no. 1: 38. https://doi.org/10.3390/engproc2025089038

APA Style

Chang, C.-C., Wei, C.-H., Wang, Y.-H., Yang, C.-H. T., & Hsiao, S. (2025). Recommender System for Apparel Products Based on Image Recognition Using Convolutional Neural Networks. Engineering Proceedings, 89(1), 38. https://doi.org/10.3390/engproc2025089038

Article Metrics

Back to TopTop