1. Introduction
The growth of online fashion communities has created a fertile ground for research in outfit compatibility modeling (see
Figure 1). Researchers can leverage large-scale datasets from platforms like Pinterest to train models that assess both visual harmony between individual items and the overall stylistic coherence of an outfit. The central challenge of this task lies in developing a comprehensive understanding of what makes a group of items compatible.
Unimodal approaches, which rely exclusively on either visual or textual information, are inherently limited in capturing the diverse and complementary signals required for accurate outfit compatibility assessment. Fashion items are intrinsically multimodal—each possesses a visual appearance, a textual description, and categorical attributes. Therefore, key aspects such as the alignment between a visual pattern (e.g., stripes) and its textual theme (e.g., “striped design”), or the categorical coherence among items (e.g., matching a formal shirt with tailored pants), are crucial for robust compatibility evaluation. Thus, multimodal fusion is not just a performance enhancer but an essential foundation for understanding the complex, interacting factors that determine fashion compatibility. This approach paves the way toward more intelligent, reliable, and context-aware fashion recommendation systems [
1,
2].
Although prior research has endeavored to obtain complete item representations by learning the multimodal information of individual fashion items, and these endeavors have yielded favorable outcomes in tackling fashion compatibility [
3,
4], three principal challenges remain.
1. Long-distance correlated features in images. As shown in
Figure 2, long-distance correlated features in images, such as color and texture correlations between a red embroidered sweater and a green A-line skirt, play a critical role in fashion compatibility modeling. Effectively mining these long-range dependencies and enhancing the quality of visual representations are therefore essential for improving fashion compatibility evaluation.
2. Interactive relationships between multimodal information. Despite the significant heterogeneity between images and text, their interaction reveals critical intermodal relationships such as complementarity and consistency, as shown in
Figure 3. A key challenge, therefore, lies in achieving deep fusion of these modalities to fully leverage multimodal information.
3. Complex correlations among items. As shown in
Figure 4, the modeling of correlations among items presents a multifaceted challenge. Beyond negative correlations, such as stylistic conflicts and color disharmony, multi-scale correlations—from fine-grained item-to-item interactions to holistic outfit-level compatibility—must also be considered. Comprehensively integrating these complex relationships poses a significant challenge in fashion compatibility evaluation.
To address the aforementioned challenges, we propose a Correlation-aware Multimodal Fusion Network (CMFN) model to accomplish the task of fashion compatibility evaluation. First, ResNet101 and word embedding are employed to extract visual and textual features, respectively. The visual/textual embedding is utilized to align multi-scale visual features with textual features, while CM-Mamba is leveraged to enhance long-distance correlated visual features. Second, consistent and complementary characteristics within multimodal features are deeply integrated. Finally, correlations across different items are mined based on negative and multi-scale correlations, followed by using a multilayer perceptron (MLP) to achieve fashion compatibility evaluation. The main contributions can be outlined as follows:
(1) We propose a novel framework that integrates a state-space model with a dynamic weighting mechanism. This design effectively captures long-distance correlated features and adaptively enhances the most discriminative ones, thereby significantly improving the quality of visual representations for fashion compatibility modeling.
(2) We introduce an improved dual-interaction mechanism to capture complex intermodal relationships. This can promote deeper multimodal fusion and lead to more effective multimodal learning.
(3) Our method comprehensively measures complex correlations among items by fusing negative and multi-scale correlations into a unified metric. This integrated approach establishes a more holistic basis for compatibility assessment.
(4) Extensive experiments are conducted to evaluate our proposed CMFN on the publicly available datasets. The results demonstrate that our model achieves significant performance compared to state-of-the-art methods. Ablation studies and case analysis verify the individual contribution of each module within the model architecture.
The rest of this article is organized as follows.
Section 2 briefly reviews related work. In
Section 3, the details of our proposed CMFN scheme are presented. The experimental results and comprehensive analyses are reported in
Section 4, and the conclusions are set out in
Section 5.
2. Related Work
This section reviews two major areas related to our research: modality processing methods and fashion compatibility evaluation models.
2.1. Modality Processing
The fashion compatibility depends not only on visual information but also on textual descriptions of the items. Thus, the field of fashion compatibility has gradually evolved from exploring single features to utilizing richer multimodal features such as images and text. According to the input information of fashion items, existing modality processing research can be broadly divided into two categories: single-modal methods and multimodal methods.
Single-modal methods only utilize the visual or textual modality of fashion items. In the early stages, Shen, Zhao, and Lin et al. [
5,
6,
7] conducted fashion compatibility evaluations using only textual information, including brands, types, and colors. However, the fashion item is highly visually dependent, and its characteristics cannot be fully described by textual descriptions. This led researchers to focus on how to use shallow layer image features and more expressive deep image features to study fashion compatibility [
8,
9,
10]. However, these studies only utilized single-image features and lacked the fusion of multimodal features, which limited the performance of the models. Consequently, researchers began to incorporate both image and text.
Multimodal methods involve more than one modality of fashion items. For example, Veit et al. [
11] added textual features of individual items based on image features and conducted compatibility modeling using the Siamese CNN. Song et al. [
12] extracted image features with a CNN and textual features from titles using Text CNN to model for measuring fashion compatibility. However, these methods overlook the long-distance correlations of visual features, which is an innovative aspect of this study in modality extraction. Since image features and textual features are heterogeneous data and have a relationship of consistency and complementarity, several scholars have attempted to use fusion strategies (i.e., early fusion [
13], mediate fusion [
14] and late fusion [
15,
16]) to integrate multimodal information.
Early fusion methods: The different modal features are fused into a single representation before compatibility evaluation. Existing early fusion methods mainly adopt fusion strategies such as concatenation, summation, and bilinear pooling. For example, Yang et al. [
17] and Zhan et al. [
18] employ splicing operations for multimodal features. Dosovitskiy et al. [
19] arranged vectors generated by pre-trained models with certain dimensions in a specific order for summation fusion. Devlin et al. [
20] and Tsai et al. [
21] fused different modalities into a joint representation space by computing the outer product of visual and textual feature vectors. However, these methods cannot account for modal interactions and impose high requirements on the spatial alignment of multimodal data. More approaches have started to be applied to multimodal fusion.
Mediate fusion methods: This method typically performs fusion after independently extracting features from each modality to form a unified representation for compatibility evaluation. Yang et al. [
22] and Lee et al. [
23] proposed Stacked Attention Networks (SANs) that use multilayer attention models to achieve the fusion of images and texts. Wang et al. [
2] and Laenen et al. [
24] used multimodal contrastive learning and an attention mechanism, respectively, to fuse visual and textual features. Tang et al. [
25] achieve efficient fusion of RGB and thermal infrared modalities by combining a unified encoder and a triple-stream architecture with the MFM, RASPM, and MDAM modules. Zhao et al. [
26] adopted a hybrid encoder structure with parallel Transformer and CNN branches, as well as a composite attention fusion strategy consisting of axial attention, channel attention, and auxiliary connection modules, achieving high-precision and universal fusion of multi-modal images.
Late fusion methods: This involves directly modeling the compatibility of features from different modalities and finally linearly combining the clothing compatibility scores from various modalities. Cui et al. [
15] proposed a Node Graph Neural Network (NGNN) for fashion compatibility modeling, where the overall compatibility score is derived from the weighted sum of scores from the visual and textual modalities. Li et al. [
27] developed a new framework, the Hierarchical Fashion Graph Network (HFGN), to model fashion compatibility, which calculates the compatibility score of items by deriving their final representation values. Although these three methods have achieved significant results, it is worth noting that the heterogeneity of multimodal data and the complexity of their interactions make multimodal fusion a popular challenge in current research.
2.2. Compatibility Evaluation
Fashion compatibility modeling necessitates that the complex correlations among diverse fashion items be explored to enable more precise compatibility evaluation. To effectively capture these intricate correlations, researchers have employed various advanced methodologies.
Some scholars have explored the correlations of fashion items using graph network methods. For instance, Song et al. [
12] utilized Graph Convolutional Networks (GCNs) to investigate the intramodal and intermodal correlations among items. Li et al.’s [
27] aforementioned Hierarchical Fashion Graph Network (HFGN) models correlations among users, items, and outfits. Subsequently, some researchers have begun to explore item correlations by combining attention mechanisms after constructing graph structures. Cui et al. [
28] adopted Graph Attention Networks to aggregate information regarding the correlations among items and then utilized a self-attention compatibility predictor for compatibility evaluation. Zhuo et al. [
29] leveraged Hypergraph Neural Networks and Graph Convolutional Networks to capture complex item correlations, incorporating attention mechanisms to achieve compatibility evaluation. Daehee et al. [
30] integrated user/item correlations by leveraging subgraph-based graph neural networks and used node-aware attention pooling for compatibility evaluation. However, GNNs only capture certain simple correlations, which does not allow full leveraging of the complex correlations among items.
Other scholars have investigated the correlations using distance metrics. Wang et al. [
31] investigated the correlations between items through the distance of features at various levels, combining this with a multilayer perceptron for compatibility evaluation. Vasileva et al. [
32] and Li et al. [
33] explored the correlations among item categories by learning through latent spaces of category embeddings, subsequently employing MLP for compatibility evaluation. While learning category embeddings aids the model in grasping various similarity concepts, the requirement for explicit labels during testing limits the model’s ability to generalize to unknown categories. In response, Reuben Tan et al. [
34] proposed a method for jointly learning conditional similarity representations and their contributions without explicit supervision, utilizing the inherent characteristics of items to learn condition-aware embeddings combined with triplet loss for compatibility evaluation. However, comprehensively considering negative correlations and multi-scale correlations remains a current challenge.
In addition, optimization algorithms play a core role in the training phase of deep learning models. In fashion compatibility modeling, the Adam method is often chosen for hyperparameter optimization, but meta-heuristic optimization algorithms also significantly contribute to hyperparameter tuning. Erden [
35] proposed a hyperparameter optimization method based on genetic algorithms to find the best parameter combinations and validated its effectiveness in optimizing hyperparameters for deep neural network models through comparisons. Guo et al. [
36] introduced a structural parameter optimization scheme that combines transfer learning optimization networks with the gray wolf optimization algorithm, achieving remarkable results in the field of super-surface technology. Maria et al. [
37] used artificial neural networks (ANNs) and tuned parameters through particle swarm optimization (PSO) to address challenges in financial application predictions, demonstrating higher reliability. Assiri et al. [
38] adopted the Piranha Fish Optimization Algorithm (PFOA) to optimally tune the hyperparameters of the stacked sparse autoencoder method to improve the fault recognition rate. Finally, the effectiveness of the technique was verified through extensive simulations.
2.3. Summary
Research on modality processing and fashion compatibility evaluation has gained increasing significance, and various approaches have been explored. However, when addressing the specific challenges of complex correlations among multimodal data and items, this research focuses on resolving the following critical research questions:
1. Mining long-distance correlated visual features: How to effectively extract visual features that encompass long-distance correlations is critical for enhancing the quality of visual features to achieve a more comprehensive understanding of the correlations among fashion items.
2. Insufficient multimodal fusion: Given the significant differences between image data and text data, as well as the complementary and consistent interactive relationships between them, how to deeply integrate these modalities to fully leverage information still merits further research.
3. Exploring complex correlations among items: This research aims to conduct an in-depth investigation into the challenges associated with various correlations, including the negative and multi-scale ones, facilitating a thorough evaluation of fashion compatibility.
3. Proposed Methodology
The framework of CMFN, as shown in
Figure 5, comprises five parts: (1) multimodal feature extraction, which is used for extracting visual features and textual features; (2) multimodal feature alignment, aligning the visual features and textual features; (3) visual feature enhancement, which can capture long-distance correlated visual features to boost the quality of visual feature representations; (4) multimodal feature fusion, which can capture consistency and complementarity information to complete multimodal fusion by the dual-interaction; and (5) compatibility evaluation, which is used to calculate the correlations of fashion outfits and to perform compatibility scoring. In this section, we first define the problem and notations and then provide details of our proposed CMFN.
3.1. Problem Formulation and Notations
In general, fashion compatibility modeling aims to predict whether a given fashion outfit is compatible or not, and it is a type of binary classification problem. Suppose there is a fashion outfit that contains a set of fashion items, which belongs to N different categories
=
, where
is the
i-th category item in the outfit. There is also a training set of
M fashion outfits,
, where
is the
m-th fashion outfit, and
is the ground truth compatibility label of the
m-th fashion outfit. Specifically,
= 1 means that the
m-th set of fashion outfits is compatible, and
= 0 means the opposite. In this research, we use the visual image of each fashion item
, and a textual description
as our multimodal input information. The objective is to evaluate the compatibility score for each outfit using our multimodal fashion compatibility model
G. The model is defined as Equation (1):
where
G denotes the CMFN model,
denotes the predicted score for a given fashion outfit
,
refers to the parameters that can be learned in the model, and
and
represent the visual and textual information of the
i-th category item in the
m-th set of fashion outfits, respectively. Notably, we omit the superscript
m of each outfit in the rest of this article for brevity.
Table 1 summarizes the main notations.
3.2. Multimodal Feature Extraction
Multimodal feature extraction aims to extract useful information from images and textual descriptions of fashion items. In this part of the experiment, we use the following modules to learn the visual and textual representation of each fashion item, respectively.
Multi-scale Visual Feature Extraction: Clothing images exhibit visual information at multiple scales, including macro-level attributes such as style, meso-level components such as shape and structure, and micro-level details including texture and fasteners. Effectively understanding and integrating these hierarchical features is critical for accurate fashion matching. We employ ResNet101 [
39] for multi-scale feature extraction, leveraging its deep architecture and skip connections that naturally preserve both low-level textures and high-level semantic patterns. As shown in
Figure 5a, the intermediate layer features from conv2 to conv5 of ResNet101 are processed with global average pooling (GAP) [
40] to obtain the multi-scale visual features
. Then, the final visual features
can be obtained with the fully connected layer. The specific equations are as follows.
Here, k represents the number of layers in the middle layer, denotes the features of the k-th layer, refers to the global average pooling, is the middle-layer features of the k-th layer for the i-th item, and is the visual features of the i-th item.
Textual Feature Extraction: The textual description of fashion items consists of different words, and word2vec [
41] is a word-based text encoding model, which can semantically differentiate during the encoding phase and capture the semantic relationships between texts. We chose Word2Vec as our text encoder because its simplicity and efficiency allow us to clearly isolate and highlight the contribution of our core multimodal fusion framework, rather than the encoder itself. To explore the textual description of each fashion item, word embedding is adopted to extract textual features. Given that each individual item contains a varying number of words, the maximum word sequence length is calculated and brought to the same length with zeros. Then, we transform the text words into a continuous vector space. The textual relationships are captured through weighted averages to obtain the textual features [
29].
Formally, the text information for item
is defined as
T =
with
R independent words. The word vector and word embedding are defined as
and
, respectively. Given the textual description
T, we can obtain the textual feature
as follows:
Here, Word2Vec aims to convert words into vectors; is the weight matrix of the word embedding model and is a trainable parameter. Using the weighted average of the textual feature, is calculated with Equation (5).
3.3. Multimodal Alignment
Multimodal alignment aims to achieve high-quality alignment between textual and visual features to enhance the model’s ability to understand characteristics across different modalities. Visual/textual embeddings map images and text to the same feature space, allowing for feature comparison and alignment between different modalities, which helps the model better understand information from various modalities. Here, visual textual embedding [
42] is adopted by providing a unified representation. Similarly to Equation (5), the visual feature
is projected into the embedding space as
, where
is the trainable parameter.
As shown in
Figure 5b, the visual semantic embedding aims to bring the corresponding
and
of the same item closer together in the joint space. This objective can be achieved by minimizing the following contrastive loss function and is defined in Equation (6).
where
is the function for computing the distance of the textual feature vector and visual feature vector. For a given item,
represent the matching textual feature vector and visual feature vector, respectively;
denotes the textual feature vector of all possible non-matching items;
denotes the visual feature vector of all possible non-matching items. According to reference [
31],
u is a margin, which represents the distance metric between the feature vector of all matching items and the feature vector of non-matching items.
3.4. Visual Enhancement
Capturing long-distance correlation relationships between items, such as color coordination between a red dress and black heels or style harmony among a graphic tee, black joggers, and sneakers, is crucial for the fashion compatibility task. Understanding the long-distance correlation of visual information allows systems to move beyond simple co-occurrence, enabling more sophisticated and inspired outfit matching. Additionally, human perception of clothing is ordered, such as from top to bottom and from left to right. CM-Mamba, as a one-dimensional sequence modeling method, can better simulate this cognitive logic. CM-Mamba is also more suitable for fashion modeling tasks. Since its internal selection state space allows for the selective updating of local key features and because it assigns unique position encoding to each block’s subsequence during sequence flattening, it better preserves the spatial features of the image. Therefore, the state-space model (SSM) in CM-Mamba can focus on long-distance correlation features [
43], combining with the dynamic weighting mechanism to adjust the importance of long-distance correlation features. Thereby, as shown in
Figure 5c, we use CM-Mamba to extract long-distance correlation features for visual feature enhancement. Firstly, the visual features
are normalized using Equation (7).
Secondly, one-dimensional convolution and state-space processing are employed to model the dynamic temporal relationship between the features. The Rectified Linear Unit (
ReLU) function is used to dynamically adjust their importance with the following equations:
where
Conv1d represents the convolution kernel as a 1 × 4 convolution, MLP is the linear transformation,
ReLU denotes the activation function, and
SSM is the a priori state-space model.
Finally, we obtain the enhanced features
using Equation (11).
where
denotes the weight of the processed feature,
is the learnable parameter of the MLP, and
is the enhanced visual features.
3.5. Multimodal Fusion
Fashion compatibility modeling faces challenges in multimodal fusion. First, features extracted from diverse modalities exhibit significant heterogeneity in both dimensionality and statistical distribution. Second, intermodal relationships, such as complementarity and consistency, exhibit considerable complexity, which is hard to adequately capture. Consequently, we have designed a multimodal fusion module by referring to the dual-interaction mechanism [
44]. The parallel processing mechanism of up-sampling and down-sampling enables the simultaneous capture of global high-level semantic features and local low-level detailed features. Combined with the advantages of residual connections, this mechanism can accurately capture the consistent correlations and complementary differences between image and text modalities, thereby significantly enhancing the representational richness of multimodal features. On this basis, the consistent and complementary information is efficiently integrated through feature concatenation and non-linear transformation operations, ultimately achieving deep fusion of multimodal features. As shown in
Figure 6, up- and down-sampling [
45] are employed to project the multimodal features into a unified feature space, which can address data heterogeneity. Meanwhile, the consistency information can be captured using down-sampling, which adopts compression and abstraction of features. The complementarity information is obtained through up-sampling to magnify and enrich features. Additionally, the MLP layer effectively integrates the concatenated consistency and complementarity information using multiple non-linear transformations. As shown in
Figure 5d, the visual features and text are unified to different feature dimensions through up- and down-sampling. Then, the features
and
containing complementarity and consistency information are output. The equations are as follows:
Here, is the enhanced visual features as calculated in Equation (11), and is the textual features; GELU is the activation function; , , , and denote the weights of the linear layers; , , , and denote the learnable parameters.
Finally, we employ an MLP to fully integrate consistent and complementarity information to obtain the fused features
. The formula is defined as Equation (14):
where
,
are the weight and learnable parameter of the MLP, respectively;
denotes the activation function;
represents the fused features.
3.6. Compatibility Evaluation
A typical complete fashion outfit consists of different categories of items such as tops, bottoms, and footwear. The key challenge in constructing a harmonious outfit lies in combining items from different categories that exhibit strong correlations. Currently, the common approach of calculating the correlation between items is to measure the distance between item features in a shared embedding space. However, this method has some limitations.
1. The importance of negative correlations between items should be considered. The negative correlations, such as seasonal dissonance between short sleeves and cotton boots and material opposition between glossy leather trousers and a linen shirt, can effectively identify potential conflicts. Therefore, considering the negative correlations is crucial for optimizing the overall fashion compatibility.
2. The issue of partial compatibility, or even global incompatibility, may arise; for example, in an outfit consisting of a plaid shirt, a red mini skirt, and a pair of green boots, an individual item may exhibit high visual similarity to each of the two other items, but it may result in incompatibility as a complete outfit.
To tackle these issues, we employ projection embedding [
31,
46] to map item combinations from different categories into distinct subspaces. Within these subspaces, we compute both negative correlations and multi-scale correlations for each item combination. These complementary measures are then integrated via weighted fusion, ultimately enabling a comprehensive compatibility evaluation for an outfit.
(1) Negative Correlation Calculation based on Projection Embedding: To capture the negative correlations between individual items, we first project item combinations from different categories into subspace embeddings. The negative correlations are then computed between the fused feature representations
and
using Pearson’s correlation coefficient. It can quantify the degree of negative correlation between different fashion items to identify conflicting items, thereby improving the performance of fashion modeling. The specific formula is defined in Equation (15).
where
represents Pearson’s correlation coefficient, and
and
are the features of the
i-th and
j-th item after fusion, where
.
(2) Multi-scale Correlation Calculation based on Projection Embedding: Similarly, we project paired item combinations from different categories into distinct subspaces via embedding, and then compute multi-scale correlations between the multi-scale features
and
within each subspace using cosine similarity. The specific formula is as follows.
Here,
is the projection of the
i-th commodity conditional on the combination (i, j), d (.,.) refers to the cosine similarity computation function, and
k represents the features of the
k-th layer.
where
is a vector of learnable masks with the same dimensions as the features
, and
represents an element-level product operation. The mask
acts as an element-level gating function that selects relevant elements in the feature vector under different compatibility conditions [
46].
(3) Compatibility Evaluation: We perform the weighted fusion of both negative correlations and multi-scale correlations. This not only allows the model to better understand the relationships between items and improve the rationality of the matching but also helps identify fashion items that are unsuitable for matching. This can help the model better understand the complex correlations between fashion items to improve the performance and accuracy of fashion modeling. Ultimately, for a given fashion outfit, the resulting representation is then passed into an MLP comprising two linear layers for compatibility evaluation. The overall process is formally defined by Equation (18):
where
denotes the compatibility score of the
m-th outfit;
,
,
represent the weights;
is the activation function;
is the learnable parameter of the MLP;
refers to the negative correlations;
refers to the multi-scale correlations.
3.7. Loss Function
To optimize our model, the binary cross-entropy (BCE) [
47,
48] loss is used to construct our classification loss as Equation (19):
where
represent the true and predicted results for the
m-th outfit, respectively.
Additionally, similar to [
31], we employ
to encourage sparsity in the masking process and utilize
to promote normalized encoding by the CNN in the latent space.
where
refers to the
-norm, and
represents the
-norm.
The ultimate loss function is defined in Equation (21):
where
is the total loss,
denotes the weight of the type mask loss,
is the weight of the feature vector loss, and
represents the weight of the visual textual embedding loss.
The model training flow of our proposed CMFN scheme is shown in Algorithm 1.
| Algorithm 1: The training procedure of CMFN |
Input: : Given fashion outfit;: multimodal information of each fashion item. : The ground-truth fashion compatibility label; Output: : Parameters of proposed CMFN model. 1 Initialization: Randomly initialize all network parameters ; 2 Repeat; 3 for epoch = 1, 2, , 50 do 4 Extract multimodal features of fashion items: 5 Extract the multi-scale visual feature with Equations (2)–(3); 6 Extract the textual feature with Equations (4)–(5); 7 Learn multimodal alignment representation of fashion items: 8 Perform multimodal alignment with Equation (6); 9 Visual feature enhancement, learning high-quality visual features: 10 Enhance visual feature representation with Equations (7)–(11); 11 Learn multimodal fusion representation of fashion items: 12 Convert and into a unified expression space and with Equations (12)–(13); 13 Obtain the fused features representation with Equation (14); 14 Explore the correlations between items: 15 for i = 1,2, n do 16 Update negative correlations with Equation (15); 17 Update multi-scale correlations with Equations (16)–(17); 18 end for 19 Predict fashion compatibility scores: 20 Update the predicted compatibility score with Equation (18); 21 Compute the overall loss with Equation (21) and update all network parameters ; 22 end for 23 Until Convergence; 24 Return the model parameters . |
4. Experiments
To verify the effectiveness of the model improvements, numerical experiments are conducted on typical datasets. We first introduce the datasets, evaluation metrics, and experimental settings, followed by presenting the research results through model comparisons, ablation experiments, and case studies.
4.1. Datasets
To evaluate our proposed method, we adopted the Polyvore Outfits dataset [
32], which is widely used by previous fashion compatibility studies. In light of whether fashion items overlap in the training, the Polyvore Outfits dataset was split into two versions based on whether fashion items overlap in the training, validation, and testing sets: the non-disjoint and disjoint versions, termed as Polyvore Outfits-ND and Polyvore Outfits-D, respectively. Statistics of the two datasets are shown in
Table 2.
The Polyvore Outfits-ND and -D datasets comprise outfits with 2 to 19 and 2 to 16 items, respectively. Each fashion item across both datasets is associated with multiple modalities, including a visual image, a textual description, a popularity score, and category information. Regarding categories, the dataset provides two levels of annotation: 11 coarse-grained and 154 fine-grained categories. In this work, we utilize the visual images, textual descriptions, and category information of the items.
To construct the negative outfits, we randomly chose a fashion item from the same category in the dataset to replace the corresponding item in the positive samples. Since fashion coordination generally follows esthetic rules, outfits that were randomly swapped are very likely to be incompatible.
4.2. Design of the Experiments
We aim to address the following four questions through numerical experiments:
Q1: Can the CMFN model outperform other methods in fashion compatibility?
Q2: What is the role of the different modules in the CMFN model?
Q3: What is the sensitivity of various hyperparameters of the CMFN model?
Q4: What are the qualitative evaluation results of our CMFN for specific tasks?
Q5: How about the generalization ability of the CMFN?
4.2.1. Evaluation Tasks and Metrics
Our model is evaluated with two tasks, namely, fashion compatibility prediction (FCP) and fill-in-the-blank (FITB). The evaluation metrics are AUC and ACC, respectively.
FCP and AUC: The task of FCP is to predict the fashion compatibility score for a given outfit, as shown in
Figure 7a. For FCP, we exploit the AUC (area under the curve) as the corresponding evaluation metric, which is the area under the ROC (receiver operating characteristic) curve that represents the performance of the classifier. The closer the AUC is to 1.0, the better the model performs in the fashion compatibility prediction task.
FITB and ACC: The task of FITB is to select the most compatible fashion item from a candidate item set to fill in the blank for obtaining a compatible and complete outfit, as shown in
Figure 7b. In our experiments, each candidate set has four options. The task is completed by scoring four different options—replacing only the blank area with various options—and selecting the highest-scoring option as the answer. Under these circumstances, the metric of ACC (accuracy) is used for evaluation, which can measure the compatibility between the predicted candidate item and the existing items. Obviously, a higher ACC is better.
4.2.2. Experimental Parameters
In the visual feature extraction part of the experiment, ResNet101 was used to output multi-scale features
and visual features
, where the dimension
= 2048. The spatial size of the input images is 224 × 224 pixels. For textual feature extraction, we employed a pre-trained word2vec to extract textual features
, where the dimension
= 2048. We implemented our proposed CMFN with a PyTorch v2.1.0 framework and conducted experiments on an NVIDIA vGPU with 32 GB of video memory. For the optimization, we employed the adaptive moment (Adam) estimation method as the training optimizer to optimize the model. Specifically, we adopted the grid search strategy to determine the optimal values for the hyperparameters
among the values {1, 2, 3, 4, 5}. In addition, the learning rate and the hidden state dimension for all methods were searched in [1 × 10
−1, 1 × 10
−2, 1 × 10
−3, 1 × 10
−4] and [512, 1024, 2048, 4096]. The model was fine-tuned for 50 epochs based on the training and validation sets, and we reported the performance on the testing set. Polyvore Outfits-ND takes about 9 h to train 50 epochs, while Polyvore Outfits-D takes about 3 h. Numerical experiments indicate that our model demonstrates runtime efficiency virtually identical to that of the baseline methods. The experiments showed that the model achieved optimal performance when the initial learning rate was set to 1 × 10
−2 (decaying by half every 10 epochs), the batch size to 44, and the hidden state dimension to 2048. The hyperparameters α and β in the loss function were set to 5 × 10
−4 and 5 × 10
−3, respectively. The specific parameter settings are shown in
Table 3.
4.3. Model Comparisons (Q1)
To demonstrate the effectiveness of our model, the following state-of-the-art methods are employed as baselines.
The benchmark models we selected are currently mainstream models with significant influence and representativeness. These models not only perform excellently in terms of performance but also have significant advantages in the research on the combination of information from different modalities, such as images, text, and category information, which aligns closely with our considerations of multimodal information. Analyzing and comparing with these benchmarks helps to better validate the effectiveness of our approach. In addition, none of these baseline models directly adopt Transformers or CLIP, possibly because they cannot meet the correlation-oriented objectives of fashion compatibility tasks and may face scalability challenges with high-dimensional structured inputs. As shown in
Table 4, these models can be roughly divided into three categories: (1) sequence-based methods by learning the potential relationship of visual features among fashion items; (2) graph network-based methods by constructing a fashion item relationship graph; (3) pair-based methods based on measuring the compatibility between two fashion items by a distance metric. Our model is most similar to the MCN [
27], which is a pair-based method model as well. In summary, the above models adopt different types of state-of-the-art methods for fashion compatibility prediction.
As illustrated in
Table 5, our CMFN outperforms the state-of-the-art methods across various evaluation metrics. Specifically, the ACC has exhibited a 5–10% improvement. Both
Anet and HAM use cross-modal attention and self-attention for multimodal fusion; however, the performance of HAM is not as good as that of
Anet. This may be because
Anet employs graph self-attention to mine relational information. Moreover, the performance of our model is superior to that of
Anet [
28] and HAM [
33]. However, the performance improvement of our method can be attributed to its ability to use Mamba to capture key long-distance visual features, which avoids the loss of detailed features caused by the global averaging of self-attention and enhances the modeling of implicit negative correlations between fashion items. We can conclude the following results.
(1) The sequence-based method performs the worst, which may be because it processes the items in the outfit in a fixed order, but the items within the outfit are unordered. This method is unable to effectively capture the complex correlations between the items, leading to poorer performance.
(2) Compared to the graph-based method, our method performs better. It may be because we consider the long-distance correlated visual features in the image. which is able to better understand the global correlations between the items, enhancing the overall semantic quality of the image.
(3) Among the pair-based methods, our model achieves the best results. It indicates that simultaneously considering the long-distance correlated visual features and the negative correlations between items is crucial, which helps to more accurately evaluate fashion combinations.
Furthermore, although the proposed CMFN only shows a limited improvement in the AUC metric, it still outperforms all baseline models, thus verifying the performance gains of the model. Meanwhile, the model achieves favorable accuracy (ACC) performance—as an intuitive performance evaluation criterion, this metric directly demonstrates the model’s strong capability in the task of sample category recognition. This result not only reflects the stability of the model in practical application scenarios but also compensates for the limited improvement of the AUC metric, thereby further verifying the effectiveness of the model design in this study. However, the observed improvement in AUC is relatively modest, which can be attributed to two main factors. First, the baseline evaluation metrics are already high, making further substantial gains inherently challenging. This trend is consistently reflected in recent cutting-edge studies, including
Anet [
28], FCM-CMAN [
52], and PS-OCM [
51], where reported improvements in AUC have consistently remained below 0.1%. Second, fashion compatibility prediction is inherently subjective and complex, often requiring more extensive datasets and domain-specific knowledge. Under these circumstances, incremental model enhancements alone may have a limited effect on overall performance.
4.4. Ablation Experiments (Q2)
The ablation experiments are conducted to test the effect of three improvements by CMFN. First is the visual enhancement model that extracts long-distance visual features. The second one is the multimodal fusion model. Last is an investigation of negative correlation relationships based on Pearson’s correlation coefficient. Specific experimental designs are as follows.
W/O-VE: Only the visual features are used as the image features to eliminate the influence of the visual enhancement module.
W/O-MF: The multimodal fusion module of capturing consistency and complementarity information has been removed, leaving only the multimodal feature alignment to represent the multimodal information.
W/O-CE: The correlation relationships are computed based on multi-scale features, thereby eliminating the impact of the hidden negative correlation associations between different items.
The results shown in
Table 6 indicate that our model outperforms the variant models, which verifies the effectiveness of the three modules. In particular, when there is no VE, the metric ACC significantly decreases, indicating that the long-distance correlation features extracted by the visual enhancement module can effectively improve the model’s accuracy in selecting the correct answer in the FITB task. Similarly, if it has no CE module, the metric AUC significantly drops, which confirms that the model can make more accurate fashion compatibility predictions after exploring negative correlations. More importantly, when there is no MF module, the model’s performance is lower than in any other case, which indicates that the dual-interaction design can effectively exploit the consistency and complementarity of multimodal information to achieve deep fusion of multimodal features.
4.5. Parameter Sensitivity Analysis (Q3)
In this section, we investigate the impact of key hyperparameters in our proposed CMFN, including the number of multimodal fusion layers μ, embedding dimension d, and learning rate lr.
To explore the impact of the number of multimodal fusion layers
μ, we evaluate our model’s performance on two tasks and two datasets by incrementally changing μ from 1 to 6 with an increment of 1 each time. As shown in
Figure 8a, the model achieves optimal performance when μ is 5. This indicates that the model can deeply integrate the consistency and complementarity information of images and text. When μ is less than 5, the performance improves. This could be because in the initial fusion layers, the model can typically capture effective features between different modalities quickly and perform feature integration. However, when μ exceeds 5, the performance declines. One possible reason is that too many layers may introduce redundant information, causing the model to become overly reliant on certain features, thereby reducing overall performance.
At the same time, we also studied the impact of embedding dimension, regularization rate, and learning rate on the model performance. In particular, we conducted experiments with embedding dimensions of [512, 1024, 2048, 4096] and learning rates of [1 × 10
−1, 1 × 10
−2, 1 × 10
−3, 1 × 10
−4].
Figure 8b,c show the effects of these parameters on the performance of the Polyvore Outfits-ND and Polyvore Outfits-D datasets. The model achieves the optimal performance when the value of dimension is set to 2048, and the learning rate is 1 × 10
−2. However, as seen in
Figure 8c, the model is more sensitive to the learning rate, which indicates that the learning rate has a significant impact on model performance. Additionally, as the embedding dimension increases, the model exhibits a trend of first rising and then dropping, which may be due to insufficient features when the embedding dimension is low, while an excessively high embedding dimension can lead to feature overload.
4.6. Case Study (Q4)
For better understanding of our CMFN, we conduct case studies on two evaluation tasks as well.
Table 7 presents the results of four test samples for the FPC task. We can see that the first and second groups obtained low scores, which means they are incompatible. It is obvious that the color and style of the two sets do not match. The third and fourth groups have a more consistent style and color, thus receiving higher scores. Through the ablation study on each outfit, it can be seen that for the first and second groups, there are obvious negative correlations, such as the color conflict between the blue short-sleeved shirt and the bright red pants and the style conflict between the dress and the shoes. Therefore, when the CE module is removed in these two groups, the model’s prediction score is actually higher, because it fails to recognize these conflicts. For the third and fourth groups, the overall clothing details are more closely related. When the VE module is absent, the model lacks the ability to capture long-range association features, resulting in a compatibility score significantly lower than the complete model. It reveals that our method can provide a helpful compatibility assessment for fashion suits.
In
Table 8, the left column sets the questions for the fill-in-the-blank (FITB) items, while the middle column presents the corresponding four answer options with the ground truth items marked by a black box. For example, in the first group, the first item in the middle column is the most suitable option since the whole group belongs to summer and has a more minimalist style. The second group appears to be more formal in terms of occasion, making the second item more appropriate. In the last group, considering the color matching, the first option is a better fit. Moreover, the case shows that only the CMFN and W/O-CE gave the correct choice in the first group. This indicates that W/O-CE plays a more obvious role when there is a conflict in matching. In the last two groups, only the CMFN gave the correct option, which highlights the importance of each component module. Overall, it can be seen that the CMFN can accurately select the missing item, demonstrating the effectiveness of the model. Furthermore, the FITB problem is randomly selected from the original fashion dataset. Our CMFN can select accurate options and is generally independent of the input order of fashion item sequences.
4.7. Generalization Ability Analysis (Q5)
To verify the generalization ability of the model, we conducted experiments on the training and test sets of the Polyvore Outfits-ND and Polyvore Outfits-D datasets. First, the training and test sets of these two datasets were obtained via random division, which ensures data independence and facilitates the evaluation of the model’s generalization ability. Second, the Polyvore Outfits-ND and Polyvore Outfits-D datasets correspond to non-disjoint and disjoint versions, respectively, depending on whether there is an overlap between the data in the training and test sets. There is no overlap between the training set and the test set of the Polyvore Outfits-D dataset, meaning that the model cannot access any samples from the test set during the training process. Such independence enables the test set to serve as a genuine evaluation benchmark, ensuring that the model performance does not rely on memorizing training data and thus verifying the model’s generalization ability. As shown in
Table 9, the performance discrepancies in AUC and ACC between the test set and the training set of the same dataset are relatively small. For example, on the Polyvore Outfits-ND and Polyvore Outfits-D datasets, the differences in AUC and ACC metrics between the training set and the test set are all less than 0.01. This phenomenon indicates that our proposed CMFN model achieves consistent performance across different datasets. Such consistency reflects the model’s stability when confronted with unseen data, proving that the model possesses a certain degree of generalization ability.
5. Conclusions
Fashion compatibility assessment has emerged as a significant research focus in AI and computer vision, largely driven by its substantial commercial value for recommendation systems and styling services. This study addresses the challenge of fashion compatibility modeling using multimodal fusion, particularly concerning the complex correlations and relationships that exist in the multimodal data and items. We proposed a correlation-aware multimodal fusion network (CMFN) framework to achieve the task of fashion compatibility modeling. The proposed framework operates in three cohesive stages:
Feature Extraction and Alignment: Visual features are extracted via ResNet101 and textual features via word embedding. A visual/textual embedding module aligns these multi-scale features, while a CM-Mamba module enhances long-distance correlated feature representations.
Multimodal Fusion: The aligned features are processed to deeply integrate their consistent and complementary interactive relationships between multimodal information into a unified representation.
Compatibility Prediction: Compatibility is assessed by mining correlations across items, focusing on negative and multi-scale interactions, followed by a final prediction from an MLP.
The key contributions of this research are summarized as follows:
The first lies in effectively capturing long-distance correlated features by adopting a state-space model to extract these features and employing a dynamic weighting mechanism for critical correlations. The results reveal that it can ultimately enhance the quality of visual representations.
The second contribution addresses the challenge of deep multimodal fusion between visually and textually heterogeneous data (features). We propose a dual-interaction mechanism that effectively captures critical intermodal relationships, such as complementarity and consistency, thereby facilitating deep integration and enabling more effective multimodal learning.
To address the challenge of modeling multifaceted item correlations, including negative and multi-scale relationships, a unified metric is proposed, which fuses measurements from Pearson’s correlation coefficient and cosine similarity to provide a comprehensive basis for assessment.
Extensive experiments on the Polyvore Outfits-ND and Polyvore Outfits-D datasets show that our model outperforms state-of-the-art methods in terms of both AUC and ACC metrics. Furthermore, ablation studies conducted on the constituent modules verify the individual contribution of each component within the architecture. The generalization capability of the model is also systematically validated through comparative analysis between its performance on the training and test sets.
Future research should explore the mechanisms for providing personalized fashion suit recommendations to specific users. However, in the era of big data, it is crucial to gain in-depth insights into users’ behavioral information regarding their preferences for specific fashion products while protecting user privacy, which will facilitate the development of the personalized fashion industry. In addition, another limitation of this study is that it does not currently take into account the attribute context of fashion items, and we plan to further optimize the performance of the model by introducing the attribute context information in future research. Last but not least, the compatibility assessment workflow can also be viewed as a protocol or service. Attempts can be made to deconstruct the CMFN framework into clearer phased modules, which can enhance the interpretability of the system [
53].