Recommender System for Apparel Products Based on Image Recognition Using Convolutional Neural Networks

Chang, Chin-Chih; Wei, Chi-Hung; Wang, Yen-Hsiang; Yang, Chyuan-Huei Thomas; Hsiao, Sean

doi:10.3390/engproc2025089038

Open AccessProceeding Paper

Recommender System for Apparel Products Based on Image Recognition Using Convolutional Neural Networks^†

by

Chin-Chih Chang

^1,*

,

Chi-Hung Wei

²,

Yen-Hsiang Wang

¹,

Chyuan-Huei Thomas Yang

³ and

Sean Hsiao

⁴

¹

Department of Computer Science and Information Engineering, Chung Hua University, Hsinchu 30012, Taiwan

²

Ph.D. Program in Engineering Science, Chung Hua University, Hsinchu 30012, Taiwan

³

School of Information Engineering, Shandong Vocational and Technical University of International Studies, Rizhao 276826, China

⁴

Department of Computer Science and Information Engineering, Ming Chuan University, Taoyuan 48710, Taiwan

^*

Author to whom correspondence should be addressed.

^†

Presented at the 2024 IEEE 7th International Conference on Knowledge Innovation and Invention, Nagoya, Japan, 16–18 August 2024.

Eng. Proc. 2025, 89(1), 38; https://doi.org/10.3390/engproc2025089038

Published: 14 March 2025

(This article belongs to the Proceedings of 2024 IEEE 7th International Conference on Knowledge Innovation and Invention)

Download

Browse Figures

Versions Notes

Abstract

In e-commerce and fashion, personalized recommendations are used to enhance user experience and engagement. In this study, an apparel recognition and recommender system (ARRS) using convolutional neural networks (CNNs) was employed to analyze apparel images, extract features, and provide accurate recognition and recommendations. By learning patterns and features of clothes, the system enables robust recognition and personalized suggestions. The effectiveness of ARRS in recognizing apparel and generating relevant recommendations was validated. The system enhances user satisfaction and engagement on fashion e-commerce platforms.

Keywords:

convolutional neural network; image recognition; recommender system

1. Introduction

Recent advancements in e-commerce technologies have revolutionized online shopping. As online retailers grow and apparel choices multiply, consumers can have personalized recommendations. To meet this demand, we improved the apparel recognition and recommendation processes of deep learning, particularly convolutional neural networks (CNNs) [1].

CNNs, inspired by visual perception, excel at extracting and learning intricate patterns from images, making them ideal for recognizing and categorizing apparel. We explored the design, implementation, and performance evaluation of an apparel recognition and recommender system (ARRS) using CNNs. By analyzing apparel images and extracting key features, the system generates personalized recommendations, aiming to enhance user satisfaction and engagement in fashion e-commerce [2,3]. The proposed system improves the efficiency and effectiveness of online fashion shopping.

This paper is structured as follows: Section 1 introduces the study, Section 2 presents related work, Section 3 outlines the research methodology, Section 4 discusses experiments and analysis, and Section 5 concludes the paper with future research directions.

2. Related Work

2.1. Deep Learning and CNN

Deep learning, a transformative force in artificial intelligence (AI), has revolutionized computer vision, natural language processing, and pattern recognition. Its success underpins innovations such as ChatGPT 3.5 and NVIDIA technologies [4,5]. At its core, deep learning uses neural networks, computational models inspired by the brain, consisting of interconnected layers that extract meaningful patterns from data [1].

CNNs, specialized for image analysis, apply learnable filters to capture patterns at different scales, such as edges and textures [2,3]. CNNs excel in autonomously learning hierarchical features, adjusting parameters through extensive training to accurately recognize complex objects. Key models, including ResNet50 [6,7], VGG19 [8], and YOLOv8 [9,10,11], exemplify CNNs’ effectiveness, with their comparison shown in Table 1.

ResNet50, a deep CNN known for its depth and effectiveness in computer vision, uses residual connections to avoid vanishing gradients, excelling on datasets such as ImageNet [6,7]. VGG19, with its 19-layer architecture, leverages small filters and max-pooling layers to capture complex patterns, making it effective for image classification tasks [8]. You Only Look Once (YOLO) is used for real-time object detection, predicting bounding boxes and class probabilities in a single pass, with YOLOv8 showing strong performance in both detection and classification [9,10,11]. In apparel recognition and recommendation, CNNs are vital for precise image analysis and personalized recommendations. Section 3 details their application in system design.

2.2. Recommender Systems

Recommender systems predict user preferences or ratings, playing a key role in e-commerce, music, and video platforms [12,13,14]. These systems include collaborative filtering, content-based filtering, or hybrids to analyze data for ratings and browsing history. Collaborative filtering looks at user or item similarities, while content-based filtering matches user preferences with item attributes. Hybrid approaches combine these methods for better accuracy. Machine and deep learning enhance these techniques for complex recommendations. Platforms including Amazon, Netflix, and Spotify use them to improve user experience and boost sales. In fashion e-commerce, combining CNN-based image analysis with collaborative filtering provides tailored apparel suggestions.

2.3. Recommenders Based on Deep Learning

Deep learning has become a powerful tool in recommendation systems, enabling personalized content suggestions through its ability to learn complex patterns from raw data [15,16]. Current approaches use neural networks alone or combined with traditional methods. For a more in-depth exploration of deep learning-based recommendation mechanisms, refer to [16].

We enhanced recommender systems with content filtering, collaborative filtering, and deep learning, particularly in integrating content-based features from textual, visual, or multimodal data sources. CNNs process apparel images, while natural language processing handles product descriptions or reviews, improving accuracy. The proposed system integrates deep learning for image analysis and collaborative filtering for recommendations, offering personalized suggestions in fashion e-commerce.

3. Research Methodology

3.1. System Design

This system contains three main modules: data learning, image recognition, and recommendation modules, as shown in Figure 1.

The data learning module collects, preprocesses, and trains models, storing the trained model and weights for future use.
The image recognition module processes image input and uses the trained model to predict categories.
The image recommendation module extracts features, calculates similarity with user ratings, and provides personalized recommendations.

3.2. Data Collection and Preprocessing

The apparel dataset consists of 20 categories of apparel from the 108 categories provided by kaggle.com. The sources of the data are shown in Figure 2 and are organized as follows:

The dataset comprises 20 categories of apparel specifically curated for this study.
It is sourced from the “Fashion Product Images Dataset”, specifically the 108-category dataset within it.
The dataset is a collaboration with Param Aggarwal (Owner).
The data originate from the Indian fashion e-commerce website: myntra.com.
Dataset source link is https://www.kaggle.com/datasets/paramaggarwal/fashion-product-images-dataset (accessed on 20 March 2024).

We collected preferences from 60 users who viewed a clothing photo dataset with 20 categories, each containing 600 photos, as shown in Figure 3. Users selected 100 favorites per category and put in “fashion_like_100”, 10 favorites and put in “fashion_like_10”, 5 and put in “fashion_like_5”, and finally 1 favorite and put in “fashion_like_1”. Preferences were stored in a CSV file with user IDs and liked item IDs, used as ratings in the recommender system to generate suggestions.

In deep learning, data augmentation is a key to expanding the training dataset [17]. By exposing the model to a wider range of data, it enhances the model’s ability to generalize, adapt to unseen data, and mitigate overfitting [18]. Data augmentation methods, such as scaling, shearing, and horizontal flipping, are commonly employed to achieve this. Scaling adjusts image sizes while preserving proportions, aiding the model in recognizing objects of varying sizes. Shearing alters image shapes to maintain the model’s recognition capability for distorted parts. Horizontal flipping creates mirrored images, diversifying the data and enabling the model to recognize objects from different angles.

3.3. Model Design, Training, and Fine-Tuning

The proposed system adopts ResNet50, VGG19, YOLOv8n, and self-built CNN as deep-learning training modules. ResNet50 and VGG19 modules are chosen due to their advantages and nature along with the novelty of YOLOv8n. CNN’s input image size is 224 × 224, comprising six layers. It initially splits into two branches: one with 64 3 × 3 convolutional kernels with a stride of 1, and another with 32 1 × 1 convolutional kernel. Following this, it integrates 64 3 × 3 convolutional kernels with a stride of 1, utilizing the ReLU activation function and Max pooling. Subsequently, it concatenates and processes with 128 3 × 3 convolutional kernels, ReLU activation function, and Max pooling. Finally, the 2D feature maps are transformed into 1D for output through fully connected layers. Figure 4 depicts this CNN architecture.

The implementation process of the learning procedure is as follows:

Dataset construction: A dataset consisting of 20 apparel categories is established, divided into a 60:20:20 ratio for training, validation, and testing, respectively.
ResNet50, VGG19, YOLOv8n, self-built CNN training: Each training module runs for 100 iterations, with 64 images processed per iteration, and validation is also conducted for 100 iterations with the same batch size.
Testing: Testing involves processing 64 images at a time, with the final output being the accuracy rate.

3.4. Recommendation Engine

The recommendation module generates feature vectors using the selected model. In machine and deep learning, feature extraction is vital, transforming data into mathematical representations across multiple dimensions, each capturing distinct attributes or characteristics.

It calculates the similarity between the feature vectors of the input photo and photos in the database using cosine similarity is used to calculate similarity defined as

Similarity = \cos (θ) = \frac{A \cdot B}{‖A‖ ‖B‖} = \frac{\sum_{i = 1}^{n} A_{i} \times B_{i}}{\sqrt{\sum_{i = 1}^{n} {(A_{i})}^{2}} \times \sqrt{\sum_{i = 1}^{n} {(B_{i})}^{2}}}

(1)

where A and B are feature vectors [19,20,21].

The average rating of photos in the database is multiplied by the similarity to obtain a score for similarity and rating. Because the system constantly provides recommendations, it can overlook certain items to be suggested. Therefore, mechanisms must be adopted to tackle this issue. In the system, we consider item popularity. To devise a strategy for improved performance, we developed the following strategies based on the selected top-k recommendations.

For k = 1, items with high similarity scores but no ratings are prioritized at a 3% probability, with the remaining 97% of recommendations based on comprehensive scores.
For k = 5, items with high similarity scores but no ratings are prioritized with a 20% probability, with the remaining 80% of recommendations based on scores.
For k = 10, a 5% probability is allocated to include 1 item with a high similarity score but no rating in the recommendations, and a 90% probability to include 2 items with high similarity scores but no ratings. The remaining 5% of recommendations are based on comprehensive scores.

4. Experiments and Results

4.1. Experimental Device

The system was built using the following equipment, and the detailed development environment is presented in Table 2.

4.2. Model Training and Evaluation

In training deep learning networks, the accuracy rate varies as the number of iterations increases, ultimately reaching a saturation state of learning rate. Figure 5 and Figure 6 depict the training graphs for ResNet50 and self-built CNN, respectively. The summary of training results is shown in Table 3. ResNet50, VGG19, and YOLOv8n show higher accuracy than self-built CNN.

4.3. Recommendation Evaluation

In evaluation [19,20], precision and recall are key metrics. Precision is the ratio of true positives to the sum of true and false positives, indicating the model’s accuracy in predicting positive samples. Recall is the ratio of true positives to the sum of true positives and false negatives, reflecting the model’s ability to capture positive samples. F1-score is the harmonic mean of precision and recall, balancing both metrics. We use precision at k, recall at k, and F1-score as evaluation metrics, focusing on the top k recommended items to calculate precision and recall.

Precision at k is the value of successfully recommended items out of the total recommended items, where k is the total number of recommended items.

Precision at k = \frac{# of recommended items that has been rated}{# of total recommended items}

(2)

Recall at k is the number of correctly predicted items among the recommended items.

Recall at k = \frac{# of items recommended and rated}{# of all rated items}

(3)

F1-score is the harmonic mean of precision and recall.

F 1 - score = \frac{2 \times Precision \times Recall}{Precision + Recall}

(4)

The cosine similarity is used to compare four deep-learning models for collaborative filtering. Similarities are computed using a clothing dataset, and a comprehensive performance comparison is constructed using precision at k, recall at k, and F1-score. The summary of precision at k, recall at k, and F1-score is shown in Table 4, Table 5 and Table 6.

Top-10 recommended items present a more feasible solution. In the recommendations, the self-built CNN demonstrates superior precision and recall at K and F1-score. Excessively high precision in image accuracy may limit the variety of items for recommendation.

4.4. Computation Time

We apply four different training models individually to the recommendation system. Subsequently, we compare their operational speed. The summary of training time and recommendation time is shown in Table 7. YOLOv8n has the lowest training time, followed by the self-built CNN. YOLOv8n consistently has the lowest recommendation times across all top-K scenarios, followed by self-built CNN.

5. Conclusions and Future Work

We develop an efficient system that integrates apparel recognition with personalized recommendations using CNNs. While image feature extraction accuracy is important, overly precise accuracy may limit recommendation diversity. Among the four models tested, YOLOv8n shows the highest speed stability, followed by a custom-built CNN, with ResNet-50 and VGG19 trailing in training speed. Custom-built models are trained faster but with lower accuracy, highlighting the need to adjust model complexity based on specific requirements. The developed recommender system offers a foundation for exploring various deep learning frameworks and is adaptable to diverse image datasets.

It is still necessary to enhance recommendation precision by integrating contrastive language–image pretraining (CLIP), which can understand both text and images [22]. Fine-tuning CLIP on fashion descriptions and images can improve recommendation accuracy and enhance user satisfaction and engagement in fashion e-commerce.

Author Contributions

Conceptualization, C.-C.C.; Methodology, C.-C.C., C.-H.W., Y.-H.W. and C.-H.T.Y.; Writing—original draft preparation, C.-C.C.; writing—review and editing, S.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

John, D. Kelleher, Deep Learning; The MIT Press: Cambridge, MA, USA, 2019. [Google Scholar]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; van der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar]
Wu, T.; He, S.; Liu, J.; Sun, S.; Liu, K.; Han, Q.-L.; Tang, Y. A brief overview of ChatGPT: The history, status quo and potential future development. IEEE/CAA J. Autom. Sin. 2023, 10, 1122–1136. [Google Scholar] [CrossRef]
Mittal, S. A survey on optimized implementation of deep learning models on the NVIDIA Jetson platform. J. Syst. Archit. 2019, 97, 428–442. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Koonce, B. ResNet 50. In Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization; Apress: New York, NY, USA, 2021; pp. 63–72. [Google Scholar]
Zisserman, A.; Simonyan, K. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Jocher, J.G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L.; Poznanski, J.; Yu, L.; Rai, P.; Ferriday, R.; et al. ultralytics/yolov5: V3. 0. Zenodo 2020. [Google Scholar] [CrossRef]
Terven, J.R.; Cordova-Esparza, D.M. A comprehensive review of YOLO: From YOLOv1 to YOLOv8 and beyond. arXiv 2023, arXiv:2304.00501. [Google Scholar]
Jannach, D.; Zanker, M.; Felfernig, A.; Friedrich, G. Recommender Systems: An Introduction; Cambridge University Press: Cambridge, UK, 2010. [Google Scholar]
Martin, F.J.; Donaldson, J.; Ashenfelter, A.; Torrens, M.; Hangartner, R. The big promise of recommender systems. AI Mag. 2011, 32, 19–27. [Google Scholar] [CrossRef]
Resnick, P.; Varian, H.R. Recommender Systems. Commun. ACM 1997, 40, 56–58. [Google Scholar] [CrossRef]
Cheng, H.-T.; Koc, L.; Harmsen, J.; Shaked, T.; Chandra, T.; Aradhye, H.; Anderson, G.; Corrado, G.; Chai, W.; Ispir, M.; et al. Wide & deep learning for recommender systems. In Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, Boston, MA, USA, 15 September 2016; pp. 7–10. [Google Scholar]
Zhang, S.; Yao, L.; Sun, A.; Tay, Y. Deep Learning Based Recommender System: A Survey and New Perspectives. ACM Comput. Surv. 2020, 52, 1–38. [Google Scholar] [CrossRef]
Ko, B.; Ok, J. Time Matters in Using Data Augmentation for Vision-Based Deep Reinforcement Learning. arXiv 2021, arXiv:2102.08581. [Google Scholar]
Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, South Korea, 27 October–2 November 2019; pp. 6023–6032. [Google Scholar]
Zuva, K.; Zuva, T. Evaluation of Information Retrieval Systems. Int. J. Comput. Sci. Inf. Technol. (IJCSIT) 2012, 4, 35–43. [Google Scholar] [CrossRef]
Tan, P.-N.; Steinbach, M.; Kumar, V. Introduction to Data Mining, 2nd ed.; Pearson: London, UK, 4 January 2018. [Google Scholar]
Singhal, A. Modern Information Retrieval: A Brief Overview. Bull. IEEE Comput. Soc. Tech. Comm. Data Eng. 2001, 24, 35–43. [Google Scholar]
Li, Y.; Liang, F.; Zhao, L.; Cui, Y.; Ouyang, W.; Shao, J.; Yu, F.; Yan, J. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. arXiv 2021, arXiv:2110.05208. [Google Scholar]

Figure 1. System architecture.

Figure 2. Apparel data source.

Figure 3. Apparel dataset.

Figure 4. Self-built CNN model.

Figure 5. ResNet50 training graph.

Figure 6. Self-built CNN training graph.

Table 1. Comparison of CNN models.

Model	ResNet50	VGG19	YOLOv8n
Architecture	50 layers	19 layers	225–230 layers
Convolutional filters	3 × 3	3 × 3	Varying sizes
Max pooling	2 × 2	2 × 2	Various sizes
Fully connected layers	Minimal (Often 1)	Large number	Minimal (might have none)
Top-5 accuracy on ImageNet	97.33%	97.07%	Not available
Data requirements	High (Tens of thousands)	High (Tens of thousands)	High (Tens of thousands)
Computational complexity	Medium	High	Lower
Interpretability	Moderate	Limited	Moderate

Table 2. System development environment.

Hardware
Processor	Intel^® Core™ i7-8086K
Memory	DDR4-32G
Graphics Card	NVIDIA GeForce GTX 1080Ti (11G)
Hardware
Operating System	Windows 10
Development Tools	Anaconda3
Python Version	3.10

Table 3. Comparison of CNN models (Bold text highlights the highest value).

Training Model	Training Set Accuracy	Validation Set Accuracy	Test Set Accuracy
ResNet50	0.9996	0.921	0.9288
VGG19	0.9987	0.9297	0.9258
Self-built CNN	0.9897	0.8792	0.8842
YOLOv8n	0.9517	0.922	0.9233

Table 4. Summary of precision At K (Bold text highlights the highest value).

Training Model	Top-1	Top-5	Top-10
ResNet50	0.96541	0.95925	0.81454
VGG19	0.97	0.96175	0.81567
Self-built CNN	0.96833	0.96308	0.81625
YOLOv8n	0.97083	0.96242	0.81333

Table 5. Summary of recall At K (Bold text highlights the highest value).

Training Model	Top-1	Top-5	Top-10
ResNet50	0.09943	0.18272	0.61035
VGG19	0.11136	0.21986	0.76828
Self-built CNN	0.12432	0.33533	0.99172
YOLOv8n	0.14396	0.50067	0.98776

Table 6. Summary of F1-score (Bold text highlights the highest value).

Training Model	Top-1	Top-5	Top-10
ResNet50	0.18029	0.307	0.69782
VGG19	0.19975	0.35791	0.79127
Self-built CNN	0.22035	0.49747	0.89547
YOLOv8n	0.25074	0.65869	0.8921

Table 7. Summary of training time (Bold text highlights the highest value).

Training Model	Training Set Accuracy	Validation Set Accuracy	Test Set Accuracy
ResNet50	0.9996	0.921	0.9288
VGG19	0.9987	0.9297	0.9258
Self-built CNN	0.9897	0.8792	0.8842
YOLOv8n	0.9517	0.922	0.9233

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chang, C.-C.; Wei, C.-H.; Wang, Y.-H.; Yang, C.-H.T.; Hsiao, S. Recommender System for Apparel Products Based on Image Recognition Using Convolutional Neural Networks. Eng. Proc. 2025, 89, 38. https://doi.org/10.3390/engproc2025089038

AMA Style

Chang C-C, Wei C-H, Wang Y-H, Yang C-HT, Hsiao S. Recommender System for Apparel Products Based on Image Recognition Using Convolutional Neural Networks. Engineering Proceedings. 2025; 89(1):38. https://doi.org/10.3390/engproc2025089038

Chicago/Turabian Style

Chang, Chin-Chih, Chi-Hung Wei, Yen-Hsiang Wang, Chyuan-Huei Thomas Yang, and Sean Hsiao. 2025. "Recommender System for Apparel Products Based on Image Recognition Using Convolutional Neural Networks" Engineering Proceedings 89, no. 1: 38. https://doi.org/10.3390/engproc2025089038

APA Style

Chang, C.-C., Wei, C.-H., Wang, Y.-H., Yang, C.-H. T., & Hsiao, S. (2025). Recommender System for Apparel Products Based on Image Recognition Using Convolutional Neural Networks. Engineering Proceedings, 89(1), 38. https://doi.org/10.3390/engproc2025089038

Article Menu

Recommender System for Apparel Products Based on Image Recognition Using Convolutional Neural Networks^†

Abstract

1. Introduction