Attention-Based Personalized Compatibility Learning for Fashion Matching

: The fashion industry has a critical need for fashion compatibility. Modeling compatibility is a challenging task that involves extracting (in)compatible features of pairs, obtaining compatible relationships between matching items, and applying them to personalized recommendation tasks. Measuring compatibility is a complex and subjective concept in general. The complexity is reﬂected in the fact that relationships between fashion items are determined by multiple matching rules, such as color, shape, and material. Each personal aesthetic style and fashion preference differs, adding sub-jectivity to the compatibility concept. As a result, personalized factors must be considered. Previous works mainly utilize a convolutional neural network to measure compatibility by extracting general features, but they ignore ﬁne-grained compatibility features and only model overall compatibility. We propose a novel neural network framework called the Attention-based Personalized Compatibility Embedding Network (PCE-Net). It comprises two components: attention-based compatibility embedding modeling and attention-based personal preference modeling. In the second part, we utilize matrix factorization and content-based features to obtain user preferences. Both pieces are jointly trained using the BPR framework in an end-to-end method. Extensive experiments on the IQON3000 dataset demonstrate that PCE-Net signiﬁcantly outperforms most baseline methods.


Introduction
The emergence of e-commerce has provided convenient shopping methods. However, the overload of data information caused by the vast amounts of products available on shopping websites has led to the need for a recommendation system to help users find items more quickly and accurately. We focus on developing an intelligent recommendation algorithm for clothing to address this demand. One of the primary challenges is generating reasonable matching suggestions for clothing styles and types. This requirement gives rise to the need for fashion compatibility modeling, which helps determine if fashion items meet specific matching criteria. As illustrated in Figure 1, compatible items satisfy particular rules, such as having matched colors and materials, while incompatible items violate those rules. Moreover, users have individual preferences, including style, texture, pattern, and more. For instance, user1 prefers pairing casual, loose tops with wide-legged pants, user 2 has a versatile fashion taste that ranges from casual sports to elegant dresses, and user 3 enjoys wearing clothes with striped patterns.
Initially, some studies only considered visual features of fashion items [1][2][3][4] when building comparison models. Subsequently, several works [5][6][7][8] modeled compatibility by fusing both visual and textual multi-modal content. Further, researchers [5,6,9,10], distinguished overall compatibility from fine-grained compatibility. Some recent studies [11][12][13][14][15], have considered user factors in personalized recommendation tasks. Although these works have individual strengths, they fail to provide a comprehensive solution that addresses all the underlying problems. We aim to develop a personalized clothing recommendation system that takes visual and textual modalities as input, extracts fine-grained compatibility characteristics, and considers user preferences. In fashion recommendation, the primary challenge lies in accurately predicting and providing reasonable suggestions that align with a user's preferences. Researchers must concentrate on two significant issues: firstly, how to enhance the accuracy of determining fashion item compatibility from multi-modal data. Fashion recommendation is founded on the principle of fashion compatibility, which implies that various types of fashion items can be combined to create an outfit. Developing compatible feature spaces is crucial for continually advancing fashion compatibility models. For instance, researchers often employ visual features [1,9], textual features [16], category-aware feature subspaces [5,6], and neighbor node features of graph models [13,17] as inputs. Secondly, since an outfit typically comprises multiple complementing fashion items, selecting items that satisfy the user's preferences while complementing each other is the crux of outfit construction. However, fashion is an inherently intricate and subjective concept, and defining fashion items frequently involves many complex intersubjective relationships between complementary fashion items. It is critical to note that the notion of compatibility between fashion items often spans categories and encompasses intricate interactions.
To address the abovementioned challenges, our solution is an attention-based personalized compatibility embedding network called PCE-Net for clothing matching (Figure 2). This network can evaluate the compatibility between fashion items while capturing the user's personal preference from multi-modal features and their previous preferences. The two main components of the PCE-Net include attention-based compatibility embedding modeling and attention-based personal preference modeling. To overcome the first challenge, we introduce two attention branches to model fine-grained compatibility for the multi-modal data. To address the second challenge, we are inspired by the personal preference component of GP-BPR [11] and modeled attention-based personalized preference, which utilizes global latent preference factors and content-based preference factors. We introduced feature extractors and two attention branches to learn the compatibility embeddings of fashion items. We also learned user preference by matrix factorization and inner product. Finally, based on the Bayesian Personalized Ranking (BPR) framework [18], PCE-Net integrates attention-based compatibility embedding modeling and attention-based personal preference modeling. Our model has extensive application prospects on e-commerce platforms and social networks. On e-commerce platforms, it can evaluate users' personalized preferences based on their purchase history and browsing history. Then, based on the clothing the user is currently browsing, it can recommend items that match the current outfit and also cater to the user's personalized preferences. On social platforms, it can provide users with compatibility assessment functionality, calculate compatibility scores based on the outfits selected by users, and provide adjustment suggestions.
Our main contributions can be summarized in threefold: 1. First itemFirstly, we present an attention-based personalized compatibility embedding scheme for personalized clothing matching, namely PCE-Net, which jointly models attention-based (item-item) compatibility and personal (user-item) preferences; 2.
Secondly, we propose an innovative approach to capture the compatibility embeddings of multi-modal data using different attention branches separately, which has not been attempted before to the best of our knowledge. In addition, we demonstrate the effectiveness of different attention branches through ablation experiments; 3.
Lastly, we conduct extensive experiments and use t-SNE visualization on the realworld dataset IQON3000 to validate the effectiveness of our scheme against state-ofthe-art methods.
The remainder of this paper is structured as follows. We briefly review the related work in Section 2. In Section 3, we present the proposed PCE-Net in detail. The experimental results and relative analysis are provided in Section 4, followed by our concluding remarks and future work in Section 5.

Related Work
The area of fashion compatibility learning and recommendation employs two main categories of algorithms: item-item level and multi-item level approaches. The former category models compatibility interaction between two items. For example, Rendel et al. [19] proposed the pairwise interaction tensor decomposition (PITF) factorization model, which simulates pairwise interactions between users, items, and tags. Meanwhile, refs. [20,21] utilized a single latent style space to measure compatibility solely using visual features. However, a single latent compatibility space is inadequate for the detailed modeling of complex relationships among different concepts, such as color, pattern, and category. Thus, to address this gap, Veit et al. [1] proposed the Conditional Similarity Network (CSN) model, which learns different subspaces under different similarity metrics. Vasileva et al. [5] then built on [1] by introducing a type-aware learning method to attain type-aware features in a universally shared potential space. Lastly, Katrien Laenen et al. [9] incorporated three attention mechanisms for multi-modal features in the type-aware subspace, building on the previous works. When predicting the compatibility between items, concatenation is adopted and implemented in a superior scheme. To automatically recognize the relative significance of different conditions, Tan et al. [6] leveraged the attention mechanism [22]. In a different approach, Cucurull et al. [23] suggested a graph neural network utilizing undirected graphs augmented with contextual information to predict associations between two items, thereby transforming the fashion compatibility issue into a graph edge detection problem. Singhal et al. [7] proposed a holistic approach to learning visual compatibility, encompassing TC-GAE, SAE, and search techniques modeled on a graph-based network, an autoencoder, and reinforcement learning, respectively. Finally, Song et al. [16] introduced Dual Autoencoder Network (DAE) as the first model to learn the compatibility feature space while incorporating the consistent relationship between visual-textual features and the implicit preference between items via Bayesian Personalized Ranking (BPR). Song et al. [24] presented AKD-DBPR, which employs knowledge distillation to combine fashion domain expertise with deep neural networks. Yang et al. [25] introduced TransNFCM, a fashion neural compatibility model based on translations that aim to capture complex compatibility patterns via distance functions. Lu et al. [26] developed a method for personalized outfit recommendation by training hash codes for both users and items and modeling users' preferences as the average of their preference scores for each item. Song et al. [11] proposed a joint model for general compatibility and personal preferences called GP-BPR, which combines the two characteristics. Sagar et al. [12] proposed a personalized recommendation modeling scheme named PAI-BPR, which utilizes attributes for interpretability and personalized recommendation. Finally, Taraviya et al. [27] introduced PSA-Net, which learns attribute-wise visual feature subspaces via self-attention and incorporates customer embeddings to aid in recommending item pairs in a category-based subspace.
The second category of outfit recommendation models aims to capture the interactions between multiple items in an outfit composed of three or more items. An early model developed by Han et al. [8] used a sequence approach, treating the items within an outfit as ordered, and proposed the Bi-LSTM sequence model. However, this model poses a limitation as it is order-sensitive. In contrast, Cui et al. [17] introduced the NGNN model, where a directed graph represents the complex relationships between multiple items in an outfit, providing a better model for data representation. For personalized outfit recommendation, Rendle et al. [18] proposed the widely used Matrix Factorization (MF) model, while He et al. [2] introduced the VBPR model, which is based on Bayesian Personalized Ranking [18] and incorporates user preferences for visual factors. Furthermore, He et al. [2] developed FashionNet, which recommends outfits (top, bottom, shoes) using a two-stage training strategy, where a general compatibility model with personal preferences is finetuned using encoding techniques. Li et al. [13] proposed the hierarchical fashion graph network (HFGN) model, which combines compatibility modeling and personalized outfit recommendation tasks. To generate personalized outfits based on users' historical click behaviors, Xu et al. [14] developed the personalized outfit generation (POG) model, which utilizes the Transformer [22] architecture to encode users' preferences for items and outfits. Dong et al. [15] proposed a personalized capsule closet creation framework (PCW-DC) based on the Bi-LSTM [8], which learns outfit compatibility, user preference, and body type information concurrently. In addition, Lin et al. [10] presented the neural outfit recommen-dation (NOR) model, a neural network framework capable of simultaneously addressing the tasks of outfit recommendation and comment generation. The framework consists of an outfit-matching framework and a comment-generation framework. For outfit complementary item retrieval, Lin et al. [28] proposed a category-based subspace attention network and an outfit ranking loss to model the item interactions within an entire outfit. Lastly, Sarkar et al. [29] proposed OutfitTransformer, a framework based on Transformer [22], to learn an outfit-level representation.

Methodology
This section presents the problem formulation and thoroughly describes the proposed attention-based personalized compatibility embedding modeling approach.

Problem Formulation
First, assume we have a set of users U = {u 1 , u 2 , · · · , u M }, a set of tops T = {t 1 , t 2 , · · · , t N t }, and a set of bottoms refer to the index of the top and bottom. Then,

represent its visual and textual
features from different ConvNet modules. Next, we use to indicate its visual and textual embeddings through different attention branches modules. D v and D t denote the dimensions of the corresponding embeddings.
In this study, we aim to develop fashion compatibility embeddings for outfit recommendations by considering user preferences and employing an attention mechanism. Consistent with previous research [11], we explore the challenge of determining "which bottom would be preferred by the user to match the given top?". Let e m ij denote the preference of the user u m towards the bottom b j for the given top t i , based on a generated personalized rating score list of bottoms b j 's for a given top t i and hence solve the practical problem of personalized outfit matching.
To ensure accurate measurements of e m ij , we have designed a personalized compatibility embedding modeling network F that incorporates an attention mechanism. This network can integrate users' preferences for visual and textual aspects of items into the compatibility embedding model. The mathematical expression for this model is as follows: where θ F refers to the model parameters to be learned.

PCE-Net
To effectively address the challenge of personalized clothing matching, it is essential to account for both item-item compatibility and user-item preference. Modeling fashion item compatibility is a fundamental problem in this context. A significant issue, therefore, is how to generate compatibility embeddings that are helpful in clothing matching. To this end, we explore user preferences towards a bottom that complements a given top by modeling compatibility embeddings between fashion items and the user's personal preferences. Formally, we have: The attention-based compatibility embeddings modeling and attention-based personal preference modeling are denoted as C and P, respectively, with θ c and θ P as their corresponding model parameters. The compatibility interaction between the top t i and bottom b j is represented by c ij , while p mj denotes the personal preference of user u m towards the bottom b j . To balance the relative importance of both components, a non-negative trade-off parameter µ is used.

Attention-Based Compatibility Embedding Modeling
We propose a more effective way of measuring the compatibility between the top t i and bottom b j . To accomplish this, we suggest that the model learns its compatibility embeddings in latent compatibility space. In this space, complementary top-bottom pairs should be closer than incompatible pairs. Additionally, we argue that there should be a gap between the interactive features of matching top-bottom pairs and mismatched top-bottom pairs, i.e., c ij , c ik , thus turning the task of predicting compatibility into a classification problem.
To learn the preliminary features of items in visual and text modalities, we employ convolutional neural networks (CNN) which have demonstrated excellent performance in learning representations [30][31][32]. It is imperative to note that all fashion items have visual and textual modalities. For example, the information on colors, patterns, and shapes of a fashion item can be extracted from its image, while its textual description can provide information on the brand, material, and category. These two modalities provide complementary information crucial for understanding the fashion items at the feature level. Therefore, we integrate both modalities' information to learn the compatibility embeddings between fashion items. In to represent the global visual and textual features of the top t i and bottom b j , respectively, from various ConvNet modules. Inspired by previous works [6,9,10], we incorporated two attention branches to capture features that aid in compatibility embeddings modeling. These branches allow us to obtain attentive visual and textual features, which we denote as respectively. Here, D t represents the dimensionality of the latent compatibility space.
Visual Attention. To enable the compatibility embeddings module to automatically capture pertinent fine-grained visual characteristics such as color, pattern, and shape, we incorporate visual dot product attention to generate attention weights based on global visual features, v t i (v b j ) ∈ R D v . Specifically, we employ the visual attentive representation learning of the top portions as an illustration. The visual attention weight, ω t i v , can be calculated according to Equation (5) by applying the following formula: Then we calculate the attentive visual features Likewise, we can calculate the attentive visual features v Next, we utilize inner products to quantify the visual compatibility interaction between the attentive visual features of the top portion t i and the bottom portion b j .
where the inner product encodes the visual interaction scores between fashion items. Text Attention. We propose the integration of a text attention branch into the compatibility embedding network. This branch aims to capture the text features of each individual top and bottom, as well as the interactive text features of top-bottom pairs. By incorporating a text attention branch, our model is able to autonomously identify the crucial text features that contribute to compatibility interaction.
For a pair of textual features t t j and t b j , the input feature to the text attention branch is calculated as follows, where concat{. . .} refers to the concatenation operation. As depicted in Figure 2, the concatenated text features are passed through a sequence of fully-connected and ReLU layers. Subsequently, a softmax function is applied to the final activation values, producing a weight vector ω ij t of dimension D t . This vector is crucial in determining the significance of the textual compatibility embedding interaction. The expression for this process is as follows: Then, the attentive textual features of the top t i and bottom b j are as follows, Likewise, we also use inner products to measure the textual compatibility between attentive textual features of the top t i and bottom b j , where the inner product encodes the textual interaction scores between fashion items. Finally, to comprehensively measure the compatibility embeddings utilizing the aforementioned attention branches, we define the following: where π is a non-negative trade-off parameter that determines the relative importance of the two modalities. c ij denotes the interaction of compatibility embeddings of the top t i and bottom b j .

Attention-Based Personal Preference Modeling
Drawing from matrix factorization techniques, we propose a model that captures users' personalized preference for a specific type of product, a bottom, which has proven effective in personalized recommendation tasks [33][34][35][36][37][38]. The underlying principle is decomposing the user-item interaction matrix into latent factors representing users and items. Additionally, building upon the work of Song et al. [11], we expand the matrix factorization approach to incorporate latent factors that capture users' content-based preferences. This is crucial because users' preference for fashion items may stem from visual or textual features. For instance, users may prioritize visual characteristics like color and pattern, or textual features like brand and material. To comprehensively account for users' and fashion items' latent factors, as well as their content-based factors, considering both aspects is imperative.
In a similar vein, we employ the inner product to encode the latent scores for user-item interactions and the content-based scores for user-item interactions. To illustrate, let us consider the personal preference of user u m for the bottom item b j . The expression is as follows:

Objective Function
Based on the BPR framework [18], we utilize a model that captures the implicit interaction between users and fashion items. This model has been shown to effectively represent implicit preferences in various studies [11,12,16,17,39,40]. We construct a training set for training the BPR algorithm, ensuring its optimal performance.
where the quadruplet (m, i, j, k) denotes that the user u m prefers the bottom b j over b k for the given top t i . As for the compatibility embeddings and personal preference, the objective function is defined as follows, where c ik indicates the compatibility interaction between the top t i and bottom b k , and p mk denotes the personal preference of the user u m towards the bottom b k , whose specific calculation is similar to Sections 3.2.1 and 3.2.2. λ is the non-negative hyperparameter, Θ F refers to the set of parameters of the model, including ω

Experiment
To evaluate the proposed method, we conduct comprehensive experiments on the large-scale real-world dataset IQON3000 [11] extracted from the social commerce website IQON. These experiments were conducted to showcase the effectiveness of our approach.

Dataset
Our experiments were conducted on the IQON3000 real-world dataset [11], comprising 216,791 outfits created by 3568 users using 650,373 fashion items. The outfit splits provided by the authors were utilized, including 170,601 quadruplets in the training set, 23,095 quadruplets in the validation set, and 23,095 quadruplets in the test set. Each fashion item, encompassing all tops and bottoms, is associated with a visual image and a textual description. We merged all quadruplets from the training set and incorporated signals from both modalities to train the PCE-Net model.

Implementation
Visual Representation. To understand the visual attributes of fashion items, a convolutional neural network (CNN) is employed as the feature extractor. Deep CNNs have demonstrated outstanding performance in image representation learning [41][42][43]. Specifically, the ResNet-50 [44] is selected as the visual representation learning module. For each fashion item image, the final global average pooling layer's output is considered the preliminary visual characteristics. These outputs are 2048-D vectors that serve as the main visual features in the vision modality. By combining these features with the visual attention branch discussed in Section 3.2.1, we obtain the ultimate visual attributes of each fashion item.
Textual Representation. However, we encounter a limitation that must be acknowledged here. Due to the closure of the fashion website IQON, we cannot source text descriptions directly from the provided data URLs by the authors of GP-BPR [11]. Thus, we rely on the text features previously extracted by the authors. The following section will briefly overview their approach to extracting text features. The authors utilized the category metadata and title descriptions as textual information for the fashion items. These textual inputs were tokenized using the Japanese morphological analyzer Kuromoji. Each word in the text description is then represented as a 300-D vector using Nwjc2vec [45], a Japanese Word2vec method. Subsequently, the feature matrix for the overall textual description is constructed, with each word's feature vector occupying a distinct row. This textual feature matrix is input into a single-channel CNN, comprising a convolutional layer, a max pooling layer, and an activation layer. Ultimately, the output vector of each fashion item, a 400-D representation, is obtained as the preliminary textual features. Therefore, we adopt this textual representation as the initial textual modality features of the fashion items. Subsequently, the top and bottom features are fed into the text attention branch to compute the definitive textual features for each fashion item.
Detail Settings. The trade-off parameters π and µ are explored in the interval [0.0, 1.0], with π = 0.5 and µ = 0.1 identified as the optimal values. During the training process, the model parameters are randomly initialized in Equation (16) using the Normal Distribution. Furthermore, the weights of the visual attention branch and textual attention branch in Equations (6), (9), and (13) are respectively initialized using the Xavier method [46] and Uniform distribution. For optimization, we utilize the Adam algorithm [47] with a learning rate set as 0.001. The learning rate is investigated in the range [0.0005, 0.001, 0.005, 0.01]. To expedite the training and promote faster convergence, a mini-batch size of 64 is employed. The proposed approach is fine-tuned for 100 epochs, and the model's performance is evaluated on the test set. Finally, the area under the ROC curve (AUC) [48] is used as a metric to assess the effectiveness of the attention-based personalized compatibility embedding network.

Results and Discussion
We consider the following baselines in the top-bottom pairs recommendation experiments to evaluate the proposed model.
• POP-T: POP is frequently used as a baseline in recommender systems [49]. POP-T simply selects the most popular bottoms for each top and vice versa. Here, "popularity" is defined as the number of tops paired with the bottom, i.e., the number of top-bottom pairs in the training set. • POP-U: For this baseline [49], the number of users that used this bottom as a component of an outfit in the training set is used to determine the "popularity" of the bottom.  [11] combines visual and textual features of clothing with personal preferences to jointly model general (item-item) compatibility and personal (user-item) preferences, where matrix factorization for the user-item interaction matrix is performed to obtain the potential user preferences. • PAI-BPR: This model [12] is an attribute-based interpretable personal preference modeling scheme, where personalization is achieved by taking inspiration from GP-BPR [11] and adding attribute-wise interpretable results. Since the code is not publicly available, we directly report the experimental results of Table III in the original paper [12] for quantitative comparison.
The performance comparison of various techniques is shown in Table 1. These quantitative data allow us to draw the following conclusions: • BPR-DAE outperforms Bi-LSTM, demonstrating that the content-based model, which captures the compatibility relationship between items by directly extracting features from multimodal data, is superior to the sequential model (predicting the following item from the previous one). • VTBPR performs better than VBPR, TBPR, and BPR-MF, indicating the value of multimodal data in enhancing model performance.

•
To solve the problem of personalized clothing matching, GP-BPR and PAI-BPR combine generalized item-item compatibility and user-item preferences using multi-modal characteristics. Since PAI-BPR uses an attribute classification network to address the interpretability of the model, performance has been slightly improved. • PCE-NET obtains the best performance compared to the above baseline, but there is no modeling attribute classification module because PCE-NET does not focus on interpretability problems. Our model can automatically capture the compatibility features of multi-modal data using two attention branches separately, which indicates that further development and exploitation of multi-modal data is necessary for embedding learning tasks. Qualitative Results. To visually demonstrate the superiority of our model compared with GP-BPR [11] in personalized clothing matching and to analyze the impact of each component in our model, we present multiple sets of test samples along with the model's evaluation of positives and negatives in Figure 3. As previously mentioned, the testing quadruplet (m, i, j, k) indicates the user u m prefers the bottom b j than b k for the given top t i . In the first example involving user1, our model determines that the bottom b k is more compatible with t i compared to b j . Further analysis reveals that our model can capture the historical preference of users for brown, skirts, which stems from the contribution of the visual attention branch. In contrast, GP-BPR [11] captures users' preference for short skirts. Even though our model predicts incorrectly, we can see that there is a visually compatible relationship between t i and b k , which indicates that our model is capable of generating some convincing matching suggestions. In the second example, our model learns better than GP-BPR for visual features of fashion items. Figure 3. Illustration of the method comparison and the influence of the attention-based compatibility embedding and personal preference model-ing. All the quadruplets satisfy the ground truth that {u m , t i } : b i > b k . PCENET-C and PCENET-P denotes the above two components, respectively. " " and "×" separately indicates the correct and wrong judgements of the model.
In the first example involving user 3, the bottom b j and b k both have similar styles of clothing in the user's historical preferences, which may lead PCE-Net-P to predict incorrectly. However, by considering the fine-grained compatibility relationships between fashion items from PCE-Net-C, black and white pairing is more common in the matching rules, and white clothing is more versatile in the matching results, i.e., the matching degree is well, so PC-E-Net finally obtains the correct evaluation result. In the second example, the top t i and the bottom b k are indistinguishable in terms of visual features such as color, which leads to an incorrect prediction of PCE-Net-C. At this point, component P (i.e., Personal Preference Modeling) enables PCE-Net to obtain the correct prediction result by capturing the user's historical preferences.
Above, we have demonstrated that our model outperforms other baseline models to a certain extent. Additionally, both components of our model are indispensable and serve as complementary sources of information to enhance its overall performance.
To assess the learning ability of the proposed PCE-Net, we visually analyze the compatibility relation for positive and negative pairs, as well as the resulting compatibility embedding space.
Compatible Relations Visualization. Figure 4 illustrates the application of t-SNE [50] to represent the learned compatible relation space. Each dot in the plot represents the multimodal fusion feature of a top-bottom pair. The red dots represent compatible top-bottom pairs, while the blue dots represent incompatible pairs. The separation of the two relations learned by PCE-Net indicates that our model effectively distinguishes whether a compatible relationship exists between the top-bottom pairs and produces convincing matching results. We observe an intriguing occurrence of crossover points between the red and blue regions. This can be attributed to two factors: (1) The attention mechanism renders compatibility interaction modeling more intricate and implicit than general compatibility modeling.
(2) As mentioned earlier, textual attention relies on a top-bottom feature connection, which means that the final features of an item may contain some characteristics of the other item, resulting in erroneous relationship predictions. Compatibility Embeddings Visualization. In this part, we utilize t-SNE [4] to visualize the distribution of some and all test triplets (i, j, k) from IQON3000 [11] in 2-dimensional spaces. The results are shown in Figures 5 and 6. Figure 5 represents the compatibility embedding space with ten triplets. A dotted line indicates the distance between two triplets on the left. In the right part, the item enclosed in an orange box represents the given tops, while the green and red represent compatible bottoms and incompatible bottoms, respectively. The length of the dotted line corresponds to the distance between the top and bottom. In the latent compatibility space, compatible items should be closer. Both example triplets satisfy this criterion, illustrating that our model effectively learns compatible embeddings.   Figure 6 illustrates the visualization of all feature distributions obtained by combining the multi-modal features of each test triplet (i, j, k). The blue dots represent the tops, while the orange and green dots denote the positive bottoms and negative bottoms, respectively. Notably, the blue region is distinct from the areas occupied by the orange and green dots. Consequently, the overlapping feature distribution between the orange and green dots appears reasonable since they both pertain to the same category. This similarity results from our model's ability to employ two feature extractors and two attention branches to learn the embedding of fashion items.

Ablation Study
Different Modalities. We further assess the contribution of different input modalities in our model, specifically, the two variants of PCE-Net: PCE-Net-V and PCE-Net-T. PCE-Net-V utilizes only the visual modality to extract compatibility features, while PCE-Net-T focuses on the textual modality. Table 2 presents the performance of these modalities when used as inputs for PCE-Net. To provide a more precise comparison, we also offer the experimental results of the optimal baselines (GP-BPR [11] and PAI-BPR [12]) using only a single modality. Based on the findings in Table 2, we make the following observations: (1) PCE-Net-V and PCE-Net-T outperform GP-BPR-V and PAI-BPR-V, and GP-BPR-T and PAI-BPR-T, respectively. This validates the effectiveness of the attention branches we introduced to enhance model performance for different modalities. (2) PCE-Net performs better than both PCE-Net-V and PCE-Net-T, indicating that utilizing both modalities as complementary information improves the learning of compatibility embedding and enhances personalized preference modeling. (3) Interestingly, we note that model-T outperforms model-V in the GP-BPR [11] and PAI-BPR [12] baselines. The authors argue that critical features like pattern, style, and brand can be better summarized in the textual information. For instance, fashion items are more likely to be compatible if they share the same brand. However, our model PCE-Net-V attains equivalent performance to PCE-Net-T and even slightly outperforms the latter, suggesting that the visual attention branch effectively captures compatibility features automatically and enhances personalized preference modeling. In Equation (13), the non-negative parameter π denotes the weight assigned to the visual modality. Based on the aforementioned conclusions, it is imperative to incorporate multiple modalities concurrently into the model. Thus, Figure 7 presents a line graph depicting the model's performance at various values of π. As the figure reveals, our model achieves optimal performance when π = 0.5, indicating that both modalities hold equal importance. Figure 7. Performance of PCE-Net with respect to the trade-off parameter π, π = 0.5 is the best. Attention Branch. The Attention Branch plays a crucial role in our study. To evaluate its impact on the model across different modalities, we present quantitative data in Table 3. The findings highlight that employing two separate attention branches to encode the preprocessing features of visual and textual modalities effectively captures compatible interaction features. Furthermore, these features serve as complementary data, ultimately enhancing the model's overall performance.  Table 4, specifically focusing on two components: compatibility embedding modeling and personal preference modeling within PCE-Net. Our observations are as follows: (1) Our comprehensive model surpasses the performance of the two derived models containing only one component. This substantiates the vital role each component plays in our model. (2) PCE-Net-P outperforms PCE-Net-C, signifying that users' historical preferences effectively capture their personalized preferences, thereby influencing the outcome of the personalized clothing matching task. Additionally, our model's performance in Equation (2) is evaluated by showcasing its performance as a line graph in Figure 8, within the range [0.0, 1.0]. For the sake of clarity in comparing with the baseline GP-BPR [11], we also include the experimental results of both models in the same line graph, highlighting the varying parameter µ. However, we cannot compare our model with PAI-BPR [12] due to the original paper's absence of publicly available code and relevant experimental results. The figure demonstrates that, for most parameter values, our model outperforms GP-BPR, thereby affirming the validity of our approach. It is worth noting that PCE-Net exhibits lower performance than GP-BPR when µ = 1.0, likely because our text attention interaction branch constructs interactions between the top-bottom pairs, resulting in the fusion of their respective features. Consequently, without the personalized preference component's guidance, our model's performance in item recommendation is compromised.

User Study
An additional user study is conducted to further validate our model's effectiveness. A total of 100 participants are selected for this study. They are then presented with seven questions, depicted in Figure 9 (only six are displayed). As an illustration, for the first question, participants are provided with five pairs representing their historical style preferences from top to bottom. Following that, they are shown a triplet in the test set, consisting of a top, a positive bottom, and a negative bottom. The top serve as the condition, while the positive and negative bottoms are given as two options. It should be noted that the order in which the options are presented is unrelated to the positive or negative bottom. Subsequently, all participants are asked to choose a compatible bottom aligned with their historical preference for the given top. More than half of the participants choose the "positive bottom" option for each question, except the last one. In Q1, the user exhibits a preference for black bottoms and primarily favors black-black matching rules. Notably, 86% of the participants chose the positive bottom A, indicating that our model successfully simulates user preferences. It is worth mentioning that bottom A and bottom B in Q3 possess similar visual styles and align with the user's historical preferences. This fact influenced 43% of the participants to choose the negative bottom B. Upon consultation, it became apparent that they largely overlooked the compatibility features of bottom A and the top (please refer to the enlarged pink patterns for color reference). In contrast, our model adequately considered these compatibility features, resulting in accurate prediction outcomes. This further illustrates the capability of our model to learn fine-grained compatibility embeddings and enhance the performance of downstream tasks. As for Q6, due to an incorrect prediction by our model, this question was excluded from the user study to determine whether the model could generate convincing recommendation suggestions. Surprisingly, 70% of the participants also selected option B, the negative bottom. This implies that our model's misguided choice could be attributed to its assertion that the user prefers black bottoms to complement the given top. Despite the incorrect prediction in this particular case, the fact that 70% (>50%) of the participants selected the same option as suggested by our model reinforces the notion that our model can provide persuasive bottom recommendations that exhibit compatibility with the given top.

Conclusions and Future Work
Our research addresses the task of modeling compatibility embeddings in the context of fashion. To achieve this, we propose an attention-based personalized compatibility embedding network called PCE-Net, which consists of two components: attention-based compatibility embedding modeling and attention-based personalized preference modeling. By incorporating multiple attention branches for visual and textual modalities of fashion items, our model automatically captures features relevant to compatibility embedding, thereby benefiting downstream tasks such as top-bottom matching. To evaluate the effectiveness of our model, we conducted various experiments on the IQON3000 dataset. These experiments encompassed quantitative and qualitative comparisons, ablation studies for each modality/component/attention branch, t-SNE visualization, and user studies. The results of these experiments corroborated the model's ability to learn compatibility embeddings and generate convincing matching results.
However, it is essential to acknowledge a limitation of our work: we only model users' personalized preferences using potential preference factors for bottoms and content-based preference factors. In the future, we intend to address this limitation by incorporating a component that can search for visually or textually similar tops in the candidate pool, enabling the identification of compatible bottoms for personalized recommendations and ultimately enhancing the performance of the recommendation task for fashion collections.