Correlation-Aware Multimodal Fusion Network for Fashion Compatibility Modeling

Fang, Yan; Ge, Jiangnan; Xiao, Ran; Zhang, Yidan

doi:10.3390/electronics15020332

Open AccessArticle

Correlation-Aware Multimodal Fusion Network for Fashion Compatibility Modeling

School of Maritime Economics and Management, Dalian Maritime University, Dalian 116026, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(2), 332; https://doi.org/10.3390/electronics15020332

Submission received: 3 December 2025 / Revised: 9 January 2026 / Accepted: 10 January 2026 / Published: 12 January 2026

(This article belongs to the Special Issue Data Analysis and Data Fusion in System Identification and Measurements)

Download

Browse Figures

Versions Notes

Abstract

The rapid growth of e-commerce and the booming online fashion industry are driving growing user demand for sophisticated, compatible fashion outfits. As an emerging multimodal information retrieval technology, fashion compatibility modeling aims to predict the compatibility degree for any given outfit and provide complementary item recommendations for incomplete outfits. Although existing research has made significant progress in exploring fashion compatibility tasks from a multimodal perspective, it has yet to fully exploit the multimodal information and correlations among fashion items. To effectively tackle these challenges, a correlation-aware multimodal fusion network for fashion compatibility modeling is proposed. Long-distance correlated visual features are investigated during multimodal processing to enhance the quality of visual features. An improved dual-interaction mechanism is used to achieve deep multimodal fusion. Furthermore, we explore both negative and multi-scale correlations to obtain complex correlations among items and thereby enhance the accuracy of fashion compatibility assessment. Extensive experiments on real-world fashion datasets demonstrate that our method outperforms existing advanced benchmark models in AUC and ACC metrics. This indicates the efficiency of our model in enhancing fashion compatibility evaluation performance.

Keywords:

correlation-aware; multimodal fusion; state-space model; fashion compatibility

1. Introduction

The growth of online fashion communities has created a fertile ground for research in outfit compatibility modeling (see Figure 1). Researchers can leverage large-scale datasets from platforms like Pinterest to train models that assess both visual harmony between individual items and the overall stylistic coherence of an outfit. The central challenge of this task lies in developing a comprehensive understanding of what makes a group of items compatible.

Unimodal approaches, which rely exclusively on either visual or textual information, are inherently limited in capturing the diverse and complementary signals required for accurate outfit compatibility assessment. Fashion items are intrinsically multimodal—each possesses a visual appearance, a textual description, and categorical attributes. Therefore, key aspects such as the alignment between a visual pattern (e.g., stripes) and its textual theme (e.g., “striped design”), or the categorical coherence among items (e.g., matching a formal shirt with tailored pants), are crucial for robust compatibility evaluation. Thus, multimodal fusion is not just a performance enhancer but an essential foundation for understanding the complex, interacting factors that determine fashion compatibility. This approach paves the way toward more intelligent, reliable, and context-aware fashion recommendation systems [1,2].

Although prior research has endeavored to obtain complete item representations by learning the multimodal information of individual fashion items, and these endeavors have yielded favorable outcomes in tackling fashion compatibility [3,4], three principal challenges remain.

1. Long-distance correlated features in images. As shown in Figure 2, long-distance correlated features in images, such as color and texture correlations between a red embroidered sweater and a green A-line skirt, play a critical role in fashion compatibility modeling. Effectively mining these long-range dependencies and enhancing the quality of visual representations are therefore essential for improving fashion compatibility evaluation.

2. Interactive relationships between multimodal information. Despite the significant heterogeneity between images and text, their interaction reveals critical intermodal relationships such as complementarity and consistency, as shown in Figure 3. A key challenge, therefore, lies in achieving deep fusion of these modalities to fully leverage multimodal information.

3. Complex correlations among items. As shown in Figure 4, the modeling of correlations among items presents a multifaceted challenge. Beyond negative correlations, such as stylistic conflicts and color disharmony, multi-scale correlations—from fine-grained item-to-item interactions to holistic outfit-level compatibility—must also be considered. Comprehensively integrating these complex relationships poses a significant challenge in fashion compatibility evaluation.

To address the aforementioned challenges, we propose a Correlation-aware Multimodal Fusion Network (CMFN) model to accomplish the task of fashion compatibility evaluation. First, ResNet101 and word embedding are employed to extract visual and textual features, respectively. The visual/textual embedding is utilized to align multi-scale visual features with textual features, while CM-Mamba is leveraged to enhance long-distance correlated visual features. Second, consistent and complementary characteristics within multimodal features are deeply integrated. Finally, correlations across different items are mined based on negative and multi-scale correlations, followed by using a multilayer perceptron (MLP) to achieve fashion compatibility evaluation. The main contributions can be outlined as follows:

(1) We propose a novel framework that integrates a state-space model with a dynamic weighting mechanism. This design effectively captures long-distance correlated features and adaptively enhances the most discriminative ones, thereby significantly improving the quality of visual representations for fashion compatibility modeling.

(2) We introduce an improved dual-interaction mechanism to capture complex intermodal relationships. This can promote deeper multimodal fusion and lead to more effective multimodal learning.

(3) Our method comprehensively measures complex correlations among items by fusing negative and multi-scale correlations into a unified metric. This integrated approach establishes a more holistic basis for compatibility assessment.

(4) Extensive experiments are conducted to evaluate our proposed CMFN on the publicly available datasets. The results demonstrate that our model achieves significant performance compared to state-of-the-art methods. Ablation studies and case analysis verify the individual contribution of each module within the model architecture.

The rest of this article is organized as follows. Section 2 briefly reviews related work. In Section 3, the details of our proposed CMFN scheme are presented. The experimental results and comprehensive analyses are reported in Section 4, and the conclusions are set out in Section 5.

2. Related Work

This section reviews two major areas related to our research: modality processing methods and fashion compatibility evaluation models.

2.1. Modality Processing

The fashion compatibility depends not only on visual information but also on textual descriptions of the items. Thus, the field of fashion compatibility has gradually evolved from exploring single features to utilizing richer multimodal features such as images and text. According to the input information of fashion items, existing modality processing research can be broadly divided into two categories: single-modal methods and multimodal methods.

Single-modal methods only utilize the visual or textual modality of fashion items. In the early stages, Shen, Zhao, and Lin et al. [5,6,7] conducted fashion compatibility evaluations using only textual information, including brands, types, and colors. However, the fashion item is highly visually dependent, and its characteristics cannot be fully described by textual descriptions. This led researchers to focus on how to use shallow layer image features and more expressive deep image features to study fashion compatibility [8,9,10]. However, these studies only utilized single-image features and lacked the fusion of multimodal features, which limited the performance of the models. Consequently, researchers began to incorporate both image and text.

Multimodal methods involve more than one modality of fashion items. For example, Veit et al. [11] added textual features of individual items based on image features and conducted compatibility modeling using the Siamese CNN. Song et al. [12] extracted image features with a CNN and textual features from titles using Text CNN to model for measuring fashion compatibility. However, these methods overlook the long-distance correlations of visual features, which is an innovative aspect of this study in modality extraction. Since image features and textual features are heterogeneous data and have a relationship of consistency and complementarity, several scholars have attempted to use fusion strategies (i.e., early fusion [13], mediate fusion [14] and late fusion [15,16]) to integrate multimodal information.

Early fusion methods: The different modal features are fused into a single representation before compatibility evaluation. Existing early fusion methods mainly adopt fusion strategies such as concatenation, summation, and bilinear pooling. For example, Yang et al. [17] and Zhan et al. [18] employ splicing operations for multimodal features. Dosovitskiy et al. [19] arranged vectors generated by pre-trained models with certain dimensions in a specific order for summation fusion. Devlin et al. [20] and Tsai et al. [21] fused different modalities into a joint representation space by computing the outer product of visual and textual feature vectors. However, these methods cannot account for modal interactions and impose high requirements on the spatial alignment of multimodal data. More approaches have started to be applied to multimodal fusion.

Mediate fusion methods: This method typically performs fusion after independently extracting features from each modality to form a unified representation for compatibility evaluation. Yang et al. [22] and Lee et al. [23] proposed Stacked Attention Networks (SANs) that use multilayer attention models to achieve the fusion of images and texts. Wang et al. [2] and Laenen et al. [24] used multimodal contrastive learning and an attention mechanism, respectively, to fuse visual and textual features. Tang et al. [25] achieve efficient fusion of RGB and thermal infrared modalities by combining a unified encoder and a triple-stream architecture with the MFM, RASPM, and MDAM modules. Zhao et al. [26] adopted a hybrid encoder structure with parallel Transformer and CNN branches, as well as a composite attention fusion strategy consisting of axial attention, channel attention, and auxiliary connection modules, achieving high-precision and universal fusion of multi-modal images.

Late fusion methods: This involves directly modeling the compatibility of features from different modalities and finally linearly combining the clothing compatibility scores from various modalities. Cui et al. [15] proposed a Node Graph Neural Network (NGNN) for fashion compatibility modeling, where the overall compatibility score is derived from the weighted sum of scores from the visual and textual modalities. Li et al. [27] developed a new framework, the Hierarchical Fashion Graph Network (HFGN), to model fashion compatibility, which calculates the compatibility score of items by deriving their final representation values. Although these three methods have achieved significant results, it is worth noting that the heterogeneity of multimodal data and the complexity of their interactions make multimodal fusion a popular challenge in current research.

2.2. Compatibility Evaluation

Fashion compatibility modeling necessitates that the complex correlations among diverse fashion items be explored to enable more precise compatibility evaluation. To effectively capture these intricate correlations, researchers have employed various advanced methodologies.

Some scholars have explored the correlations of fashion items using graph network methods. For instance, Song et al. [12] utilized Graph Convolutional Networks (GCNs) to investigate the intramodal and intermodal correlations among items. Li et al.’s [27] aforementioned Hierarchical Fashion Graph Network (HFGN) models correlations among users, items, and outfits. Subsequently, some researchers have begun to explore item correlations by combining attention mechanisms after constructing graph structures. Cui et al. [28] adopted Graph Attention Networks to aggregate information regarding the correlations among items and then utilized a self-attention compatibility predictor for compatibility evaluation. Zhuo et al. [29] leveraged Hypergraph Neural Networks and Graph Convolutional Networks to capture complex item correlations, incorporating attention mechanisms to achieve compatibility evaluation. Daehee et al. [30] integrated user/item correlations by leveraging subgraph-based graph neural networks and used node-aware attention pooling for compatibility evaluation. However, GNNs only capture certain simple correlations, which does not allow full leveraging of the complex correlations among items.

Other scholars have investigated the correlations using distance metrics. Wang et al. [31] investigated the correlations between items through the distance of features at various levels, combining this with a multilayer perceptron for compatibility evaluation. Vasileva et al. [32] and Li et al. [33] explored the correlations among item categories by learning through latent spaces of category embeddings, subsequently employing MLP for compatibility evaluation. While learning category embeddings aids the model in grasping various similarity concepts, the requirement for explicit labels during testing limits the model’s ability to generalize to unknown categories. In response, Reuben Tan et al. [34] proposed a method for jointly learning conditional similarity representations and their contributions without explicit supervision, utilizing the inherent characteristics of items to learn condition-aware embeddings combined with triplet loss for compatibility evaluation. However, comprehensively considering negative correlations and multi-scale correlations remains a current challenge.

In addition, optimization algorithms play a core role in the training phase of deep learning models. In fashion compatibility modeling, the Adam method is often chosen for hyperparameter optimization, but meta-heuristic optimization algorithms also significantly contribute to hyperparameter tuning. Erden [35] proposed a hyperparameter optimization method based on genetic algorithms to find the best parameter combinations and validated its effectiveness in optimizing hyperparameters for deep neural network models through comparisons. Guo et al. [36] introduced a structural parameter optimization scheme that combines transfer learning optimization networks with the gray wolf optimization algorithm, achieving remarkable results in the field of super-surface technology. Maria et al. [37] used artificial neural networks (ANNs) and tuned parameters through particle swarm optimization (PSO) to address challenges in financial application predictions, demonstrating higher reliability. Assiri et al. [38] adopted the Piranha Fish Optimization Algorithm (PFOA) to optimally tune the hyperparameters of the stacked sparse autoencoder method to improve the fault recognition rate. Finally, the effectiveness of the technique was verified through extensive simulations.

2.3. Summary

Research on modality processing and fashion compatibility evaluation has gained increasing significance, and various approaches have been explored. However, when addressing the specific challenges of complex correlations among multimodal data and items, this research focuses on resolving the following critical research questions:

1. Mining long-distance correlated visual features: How to effectively extract visual features that encompass long-distance correlations is critical for enhancing the quality of visual features to achieve a more comprehensive understanding of the correlations among fashion items.

2. Insufficient multimodal fusion: Given the significant differences between image data and text data, as well as the complementary and consistent interactive relationships between them, how to deeply integrate these modalities to fully leverage information still merits further research.

3. Exploring complex correlations among items: This research aims to conduct an in-depth investigation into the challenges associated with various correlations, including the negative and multi-scale ones, facilitating a thorough evaluation of fashion compatibility.

3. Proposed Methodology

The framework of CMFN, as shown in Figure 5, comprises five parts: (1) multimodal feature extraction, which is used for extracting visual features and textual features; (2) multimodal feature alignment, aligning the visual features and textual features; (3) visual feature enhancement, which can capture long-distance correlated visual features to boost the quality of visual feature representations; (4) multimodal feature fusion, which can capture consistency and complementarity information to complete multimodal fusion by the dual-interaction; and (5) compatibility evaluation, which is used to calculate the correlations of fashion outfits and to perform compatibility scoring. In this section, we first define the problem and notations and then provide details of our proposed CMFN.

3.1. Problem Formulation and Notations

In general, fashion compatibility modeling aims to predict whether a given fashion outfit is compatible or not, and it is a type of binary classification problem. Suppose there is a fashion outfit that contains a set of fashion items, which belongs to N different categories

X

=

\{x_{i} | i = 1, 2, \dots N\}

, where

x_{i}

is the i-th category item in the outfit. There is also a training set of M fashion outfits,

\{p^{m}, y^{m} | m = 1, 2, \dots, M\}

, where

p^{m}

is the m-th fashion outfit, and

y^{m}

is the ground truth compatibility label of the m-th fashion outfit. Specifically,

y^{m}

= 1 means that the m-th set of fashion outfits is compatible, and

y^{m}

= 0 means the opposite. In this research, we use the visual image of each fashion item

v_{i}

, and a textual description

t_{i}

as our multimodal input information. The objective is to evaluate the compatibility score for each outfit using our multimodal fashion compatibility model G. The model is defined as Equation (1):

y^{m^{'}} = G ({\{v_{i}^{m}, t_{i}^{m}\}}_{i = 1}^{n} | ϑ)

(1)

where G denotes the CMFN model,

y^{m^{'}}

denotes the predicted score for a given fashion outfit

p^{m}

,

ϑ

refers to the parameters that can be learned in the model, and

v_{i}^{m}

and

t_{i}^{m}

represent the visual and textual information of the i-th category item in the m-th set of fashion outfits, respectively. Notably, we omit the superscript m of each outfit in the rest of this article for brevity. Table 1 summarizes the main notations.

3.2. Multimodal Feature Extraction

Multimodal feature extraction aims to extract useful information from images and textual descriptions of fashion items. In this part of the experiment, we use the following modules to learn the visual and textual representation of each fashion item, respectively.

Multi-scale Visual Feature Extraction: Clothing images exhibit visual information at multiple scales, including macro-level attributes such as style, meso-level components such as shape and structure, and micro-level details including texture and fasteners. Effectively understanding and integrating these hierarchical features is critical for accurate fashion matching. We employ ResNet101 [39] for multi-scale feature extraction, leveraging its deep architecture and skip connections that naturally preserve both low-level textures and high-level semantic patterns. As shown in Figure 5a, the intermediate layer features from conv2 to conv5 of ResNet101 are processed with global average pooling (GAP) [40] to obtain the multi-scale visual features

v_{i}^{k}

. Then, the final visual features

v_{i}

can be obtained with the fully connected layer. The specific equations are as follows.

v_{i}^{k} = f_{G A P} ({c o n v}^{k} (v_{i})), k \in \{1, 2, \dots, 4\}

(2)

v_{i} = \{v_{i}^{1}, v_{i}^{2}, \dots, v_{i}^{k}\}

(3)

Here, k represents the number of layers in the middle layer,

{c o n v}^{k}

denotes the features of the k-th layer,

f_{G A P}

refers to the global average pooling,

v_{i}^{k}

is the middle-layer features of the k-th layer for the i-th item, and

v_{i}

is the visual features of the i-th item.

Textual Feature Extraction: The textual description of fashion items consists of different words, and word2vec [41] is a word-based text encoding model, which can semantically differentiate during the encoding phase and capture the semantic relationships between texts. We chose Word2Vec as our text encoder because its simplicity and efficiency allow us to clearly isolate and highlight the contribution of our core multimodal fusion framework, rather than the encoder itself. To explore the textual description of each fashion item, word embedding is adopted to extract textual features. Given that each individual item contains a varying number of words, the maximum word sequence length is calculated and brought to the same length with zeros. Then, we transform the text words into a continuous vector space. The textual relationships are captured through weighted averages to obtain the textual features [29].

Formally, the text information for item

x_{i}

is defined as T =

\{h_{1}, h_{2}, \dots, h_{φ} \dots, h_{R}\}

with R independent words. The word vector and word embedding are defined as

e_{φ}

and

h_{φ}

, respectively. Given the textual description T, we can obtain the textual feature

t_{i}

as follows:

e_{φ} = Word 2 Vec (h_{φ})

(4)

t_{i} = \frac{1}{R} \sum_{φ = 1}^{R} {w_{T} e}_{φ}

(5)

Here, Word2Vec aims to convert words into vectors;

w_{T}

is the weight matrix of the word embedding model and is a trainable parameter. Using the weighted average of the textual feature,

t_{i}

is calculated with Equation (5).

3.3. Multimodal Alignment

Multimodal alignment aims to achieve high-quality alignment between textual and visual features to enhance the model’s ability to understand characteristics across different modalities. Visual/textual embeddings map images and text to the same feature space, allowing for feature comparison and alignment between different modalities, which helps the model better understand information from various modalities. Here, visual textual embedding [42] is adopted by providing a unified representation. Similarly to Equation (5), the visual feature

v_{i}^{k}

is projected into the embedding space as

v_{i}^{k} = w_{I} v_{i}^{k}

, where

w_{I}

is the trainable parameter.

As shown in Figure 5b, the visual semantic embedding aims to bring the corresponding

t_{i}

and

v_{i}^{k}

of the same item closer together in the joint space. This objective can be achieved by minimizing the following contrastive loss function and is defined in Equation (6).

L_{v t e} = \sum_{t_{i}} \sum_{\hat{v_{i}^{k}}} m a x (0, u - s_{t_{i}, v_{i}^{k}} + s_{t_{i}, \hat{v_{i}^{k}}}) + \sum_{v_{i}^{k}} \sum_{\hat{t_{i}}} \max (0, u - s_{v_{i}^{k}, t_{i}} + s_{v_{i}^{k}, \hat{t_{i}}})

(6)

where

s (t_{i}, v_{i}^{k})

is the function for computing the distance of the textual feature vector and visual feature vector. For a given item,

t_{i}, v_{i}^{k}

represent the matching textual feature vector and visual feature vector, respectively;

\hat{t_{i}}

denotes the textual feature vector of all possible non-matching items;

\hat{v_{i}^{k}}

denotes the visual feature vector of all possible non-matching items. According to reference [31], u is a margin, which represents the distance metric between the feature vector of all matching items and the feature vector of non-matching items.

3.4. Visual Enhancement

Capturing long-distance correlation relationships between items, such as color coordination between a red dress and black heels or style harmony among a graphic tee, black joggers, and sneakers, is crucial for the fashion compatibility task. Understanding the long-distance correlation of visual information allows systems to move beyond simple co-occurrence, enabling more sophisticated and inspired outfit matching. Additionally, human perception of clothing is ordered, such as from top to bottom and from left to right. CM-Mamba, as a one-dimensional sequence modeling method, can better simulate this cognitive logic. CM-Mamba is also more suitable for fashion modeling tasks. Since its internal selection state space allows for the selective updating of local key features and because it assigns unique position encoding to each block’s subsequence during sequence flattening, it better preserves the spatial features of the image. Therefore, the state-space model (SSM) in CM-Mamba can focus on long-distance correlation features [43], combining with the dynamic weighting mechanism to adjust the importance of long-distance correlation features. Thereby, as shown in Figure 5c, we use CM-Mamba to extract long-distance correlation features for visual feature enhancement. Firstly, the visual features

v_{i}

are normalized using Equation (7).

v_{n o r m} = \frac{v_{i}}{\sqrt{\sum_{i = 1}^{n} v_{i}}}

(7)

Secondly, one-dimensional convolution and state-space processing are employed to model the dynamic temporal relationship between the features. The Rectified Linear Unit (ReLU) function is used to dynamically adjust their importance with the following equations:

v_{n o r m}^{a} = S S M (C o n v 1 d (M L P (v_{n o r m})))

(8)

v_{n o r m}^{b} = R e L U (M L P (v_{n o r m}))

(9)

v_{n o r m}^{c} = S S M (C o n v 1 d (M L P (v_{n o r m})))

(10)

where Conv1d represents the convolution kernel as a 1 × 4 convolution, MLP is the linear transformation, ReLU denotes the activation function, and SSM is the a priori state-space model.

Finally, we obtain the enhanced features

v_{i}^{'}

using Equation (11).

v_{i}^{'} {= v}_{i} {+ (w_{1} ((v}_{a}^{n o r m} \times v_{b}^{n o r m}) + {(v}_{b}^{n o r m} \times v_{c}^{n o r m})) + b_{1})

(11)

where

w_{1}

denotes the weight of the processed feature,

b_{1}

is the learnable parameter of the MLP, and

v_{i}^{'}

is the enhanced visual features.

3.5. Multimodal Fusion

Fashion compatibility modeling faces challenges in multimodal fusion. First, features extracted from diverse modalities exhibit significant heterogeneity in both dimensionality and statistical distribution. Second, intermodal relationships, such as complementarity and consistency, exhibit considerable complexity, which is hard to adequately capture. Consequently, we have designed a multimodal fusion module by referring to the dual-interaction mechanism [44]. The parallel processing mechanism of up-sampling and down-sampling enables the simultaneous capture of global high-level semantic features and local low-level detailed features. Combined with the advantages of residual connections, this mechanism can accurately capture the consistent correlations and complementary differences between image and text modalities, thereby significantly enhancing the representational richness of multimodal features. On this basis, the consistent and complementary information is efficiently integrated through feature concatenation and non-linear transformation operations, ultimately achieving deep fusion of multimodal features. As shown in Figure 6, up- and down-sampling [45] are employed to project the multimodal features into a unified feature space, which can address data heterogeneity. Meanwhile, the consistency information can be captured using down-sampling, which adopts compression and abstraction of features. The complementarity information is obtained through up-sampling to magnify and enrich features. Additionally, the MLP layer effectively integrates the concatenated consistency and complementarity information using multiple non-linear transformations. As shown in Figure 5d, the visual features and text are unified to different feature dimensions through up- and down-sampling. Then, the features

v_{i}^{″}

and

t_{i}^{'}

containing complementarity and consistency information are output. The equations are as follows:

v_{i}^{″} = v_{i}^{'} + G E L U {(w}_{3} (G E L U (w_{2} v_{i}^{'} + b_{2})) + b_{3})

(12)

t_{i}^{'} = t_{i} + G E L U {(w}_{5} (G E L U (w_{4} t_{i} + b_{4})) + b_{5})

(13)

Here,

v_{i}^{'}

is the enhanced visual features as calculated in Equation (11), and

t_{i}

is the textual features; GELU is the activation function;

w_{2}

,

w_{3}

,

w_{4}

, and

w_{5}

denote the weights of the linear layers;

b_{2}

,

b_{3}

,

b_{4}

, and

b_{5}

denote the learnable parameters.

Finally, we employ an MLP to fully integrate consistent and complementarity information to obtain the fused features

f_{i}

. The formula is defined as Equation (14):

f_{i} = s o f t m a x (w_{6} {(v}_{i}^{″} + t_{i}^{'}) + b_{6})

(14)

where

w_{6}

,

b_{6}

are the weight and learnable parameter of the MLP, respectively;

s o f t m a x

denotes the activation function;

f_{i}

represents the fused features.

3.6. Compatibility Evaluation

A typical complete fashion outfit consists of different categories of items such as tops, bottoms, and footwear. The key challenge in constructing a harmonious outfit lies in combining items from different categories that exhibit strong correlations. Currently, the common approach of calculating the correlation between items is to measure the distance between item features in a shared embedding space. However, this method has some limitations.

1. The importance of negative correlations between items should be considered. The negative correlations, such as seasonal dissonance between short sleeves and cotton boots and material opposition between glossy leather trousers and a linen shirt, can effectively identify potential conflicts. Therefore, considering the negative correlations is crucial for optimizing the overall fashion compatibility.

2. The issue of partial compatibility, or even global incompatibility, may arise; for example, in an outfit consisting of a plaid shirt, a red mini skirt, and a pair of green boots, an individual item may exhibit high visual similarity to each of the two other items, but it may result in incompatibility as a complete outfit.

To tackle these issues, we employ projection embedding [31,46] to map item combinations from different categories into distinct subspaces. Within these subspaces, we compute both negative correlations and multi-scale correlations for each item combination. These complementary measures are then integrated via weighted fusion, ultimately enabling a comprehensive compatibility evaluation for an outfit.

(1) Negative Correlation Calculation based on Projection Embedding: To capture the negative correlations between individual items, we first project item combinations from different categories into subspace embeddings. The negative correlations are then computed between the fused feature representations

v_{i}

and

v_{j}

using Pearson’s correlation coefficient. It can quantify the degree of negative correlation between different fashion items to identify conflicting items, thereby improving the performance of fashion modeling. The specific formula is defined in Equation (15).

F_{m} = \frac{\sum_{i, j = 1}^{n} (v_{i} - \bar{v}) (v_{j} - \bar{v})}{\sum_{i = 1}^{n} {(v_{i} - \bar{v})}^{2} \sum_{j = 1}^{n} {(v_{j} - \bar{v}))}^{2}}

(15)

where

F_{m}

represents Pearson’s correlation coefficient, and

v_{i}

and

v_{j}

are the features of the i-th and j-th item after fusion, where

i \neq j

.

(2) Multi-scale Correlation Calculation based on Projection Embedding: Similarly, we project paired item combinations from different categories into distinct subspaces via embedding, and then compute multi-scale correlations between the multi-scale features

v_{i}^{k}

and

v_{j}^{k}

within each subspace using cosine similarity. The specific formula is as follows.

\emptyset_{m} = d (p^{i \to (i, j)} v_{i}^{k}, p^{j \to (i, j)} v_{j}^{k})

(16)

Here,

p^{i \to (i, j)}

is the projection of the i-th commodity conditional on the combination (i, j), d (.,.) refers to the cosine similarity computation function, and k represents the features of the k-th layer.

{v_{i}^{k} p}^{i \to (i, j)} = R e L U (v_{i}^{k} {\otimes r}_{(i, j)})

(17)

where

r_{(i, j)}

is a vector of learnable masks with the same dimensions as the features

v_{i}^{k}

, and

\otimes

represents an element-level product operation. The mask

r_{(i, j)}

acts as an element-level gating function that selects relevant elements in the feature vector under different compatibility conditions [46].

(3) Compatibility Evaluation: We perform the weighted fusion of both negative correlations and multi-scale correlations. This not only allows the model to better understand the relationships between items and improve the rationality of the matching but also helps identify fashion items that are unsuitable for matching. This can help the model better understand the complex correlations between fashion items to improve the performance and accuracy of fashion modeling. Ultimately, for a given fashion outfit, the resulting representation is then passed into an MLP comprising two linear layers for compatibility evaluation. The overall process is formally defined by Equation (18):

y^{m^{'}} = σ (w_{9} (w_{7} F_{m} + w_{8} \emptyset_{m}) + b_{3})

(18)

where

y^{m^{'}}

denotes the compatibility score of the m-th outfit;

w_{7}

,

w_{8}

,

w_{9}

represent the weights;

σ

is the activation function;

b_{3}

is the learnable parameter of the MLP;

F_{m}

refers to the negative correlations;

\emptyset_{m}

refers to the multi-scale correlations.

3.7. Loss Function

To optimize our model, the binary cross-entropy (BCE) [47,48] loss is used to construct our classification loss as Equation (19):

L_{c l f} = y^{m} l o g (y^{m^{'}}) + (1 - y^{m}) l o g (1 - y^{m^{'}})

(19)

where

y^{m},

y^{m^{'}}

represent the true and predicted results for the m-th outfit, respectively.

Additionally, similar to [31], we employ

L_{m a s k}

to encourage sparsity in the masking process and utilize

L_{f e a t u r e}

to promote normalized encoding by the CNN in the latent space.

L_{m a s k} (r) = {‖r‖}_{1}, L_{f e a t u r e} (f_{i}) = {‖f_{i}‖}_{2}

(20)

where

{‖.‖}_{1}

refers to the

L_{1}

-norm, and

{‖.‖}_{2}

represents the

L_{2}

-norm.

The ultimate loss function is defined in Equation (21):

L_{t o t a l} = L_{c l f} + α L_{m a s k} + β L_{f e a t u r e} + γ L_{v t e}

(21)

where

L_{t o t a l}

is the total loss,

α

denotes the weight of the type mask loss,

β

is the weight of the feature vector loss, and

γ

represents the weight of the visual textual embedding loss.

The model training flow of our proposed CMFN scheme is shown in Algorithm 1.

Algorithm 1: The training procedure of CMFN

Input:

p^{m}

: Given fashion outfit;

x_{i}

: multimodal information of each fashion item.

y^{m}

: The ground-truth fashion compatibility label;
Output:

ϑ

: Parameters of proposed CMFN model.
1 Initialization: Randomly initialize all network parameters

ϑ

;
2 Repeat;
3 for epoch = 1, 2,

\dots

, 50 do
4 Extract multimodal features of fashion items:
5 Extract the multi-scale visual feature

v_{i}

with Equations (2)–(3);
6 Extract the textual feature

t_{i}

with Equations (4)–(5);
7 Learn multimodal alignment representation of fashion items:
8 Perform multimodal alignment with Equation (6);
9 Visual feature enhancement, learning high-quality visual features:
10 Enhance visual feature representation

v_{i}^{'}

with Equations (7)–(11);
11 Learn multimodal fusion representation of fashion items:
12 Convert

v_{i}^{'}

and

t_{i}

into a unified expression space

v_{i}^{″}

and

t_{i}^{'}

with Equations (12)–(13);
13 Obtain the fused features representation

f_{i}

with Equation (14);
14 Explore the correlations between items:
15 for i = 1,2

\dots

, n do
16 Update negative correlations

F_{m}

with Equation (15);
17 Update multi-scale correlations

\emptyset_{m}

with Equations (16)–(17);
18 end for
19 Predict fashion compatibility scores:
20 Update the predicted compatibility score

y^{m^{'}}

with Equation (18);
21 Compute the overall loss with Equation (21) and update all network parameters

ϑ

;
22 end for
23 Until Convergence;
24 Return the model parameters

ϑ

.

4. Experiments

To verify the effectiveness of the model improvements, numerical experiments are conducted on typical datasets. We first introduce the datasets, evaluation metrics, and experimental settings, followed by presenting the research results through model comparisons, ablation experiments, and case studies.

4.1. Datasets

To evaluate our proposed method, we adopted the Polyvore Outfits dataset [32], which is widely used by previous fashion compatibility studies. In light of whether fashion items overlap in the training, the Polyvore Outfits dataset was split into two versions based on whether fashion items overlap in the training, validation, and testing sets: the non-disjoint and disjoint versions, termed as Polyvore Outfits-ND and Polyvore Outfits-D, respectively. Statistics of the two datasets are shown in Table 2.

The Polyvore Outfits-ND and -D datasets comprise outfits with 2 to 19 and 2 to 16 items, respectively. Each fashion item across both datasets is associated with multiple modalities, including a visual image, a textual description, a popularity score, and category information. Regarding categories, the dataset provides two levels of annotation: 11 coarse-grained and 154 fine-grained categories. In this work, we utilize the visual images, textual descriptions, and category information of the items.

To construct the negative outfits, we randomly chose a fashion item from the same category in the dataset to replace the corresponding item in the positive samples. Since fashion coordination generally follows esthetic rules, outfits that were randomly swapped are very likely to be incompatible.

4.2. Design of the Experiments

We aim to address the following four questions through numerical experiments:

Q1: Can the CMFN model outperform other methods in fashion compatibility?
Q2: What is the role of the different modules in the CMFN model?
Q3: What is the sensitivity of various hyperparameters of the CMFN model?
Q4: What are the qualitative evaluation results of our CMFN for specific tasks?
Q5: How about the generalization ability of the CMFN?

4.2.1. Evaluation Tasks and Metrics

Our model is evaluated with two tasks, namely, fashion compatibility prediction (FCP) and fill-in-the-blank (FITB). The evaluation metrics are AUC and ACC, respectively.

FCP and AUC: The task of FCP is to predict the fashion compatibility score for a given outfit, as shown in Figure 7a. For FCP, we exploit the AUC (area under the curve) as the corresponding evaluation metric, which is the area under the ROC (receiver operating characteristic) curve that represents the performance of the classifier. The closer the AUC is to 1.0, the better the model performs in the fashion compatibility prediction task.

FITB and ACC: The task of FITB is to select the most compatible fashion item from a candidate item set to fill in the blank for obtaining a compatible and complete outfit, as shown in Figure 7b. In our experiments, each candidate set has four options. The task is completed by scoring four different options—replacing only the blank area with various options—and selecting the highest-scoring option as the answer. Under these circumstances, the metric of ACC (accuracy) is used for evaluation, which can measure the compatibility between the predicted candidate item and the existing items. Obviously, a higher ACC is better.

4.2.2. Experimental Parameters

In the visual feature extraction part of the experiment, ResNet101 was used to output multi-scale features

v_{i}^{k}

and visual features

v_{i}

, where the dimension

d_{v}

= 2048. The spatial size of the input images is 224 × 224 pixels. For textual feature extraction, we employed a pre-trained word2vec to extract textual features

t_{i}

, where the dimension

d_{t}

= 2048. We implemented our proposed CMFN with a PyTorch v2.1.0 framework and conducted experiments on an NVIDIA vGPU with 32 GB of video memory. For the optimization, we employed the adaptive moment (Adam) estimation method as the training optimizer to optimize the model. Specifically, we adopted the grid search strategy to determine the optimal values for the hyperparameters

μ

among the values {1, 2, 3, 4, 5}. In addition, the learning rate and the hidden state dimension for all methods were searched in [1 × 10⁻¹, 1 × 10⁻², 1 × 10⁻³, 1 × 10⁻⁴] and [512, 1024, 2048, 4096]. The model was fine-tuned for 50 epochs based on the training and validation sets, and we reported the performance on the testing set. Polyvore Outfits-ND takes about 9 h to train 50 epochs, while Polyvore Outfits-D takes about 3 h. Numerical experiments indicate that our model demonstrates runtime efficiency virtually identical to that of the baseline methods. The experiments showed that the model achieved optimal performance when the initial learning rate was set to 1 × 10⁻² (decaying by half every 10 epochs), the batch size to 44, and the hidden state dimension to 2048. The hyperparameters α and β in the loss function were set to 5 × 10⁻⁴ and 5 × 10⁻³, respectively. The specific parameter settings are shown in Table 3.

4.3. Model Comparisons (Q1)

To demonstrate the effectiveness of our model, the following state-of-the-art methods are employed as baselines.

The benchmark models we selected are currently mainstream models with significant influence and representativeness. These models not only perform excellently in terms of performance but also have significant advantages in the research on the combination of information from different modalities, such as images, text, and category information, which aligns closely with our considerations of multimodal information. Analyzing and comparing with these benchmarks helps to better validate the effectiveness of our approach. In addition, none of these baseline models directly adopt Transformers or CLIP, possibly because they cannot meet the correlation-oriented objectives of fashion compatibility tasks and may face scalability challenges with high-dimensional structured inputs. As shown in Table 4, these models can be roughly divided into three categories: (1) sequence-based methods by learning the potential relationship of visual features among fashion items; (2) graph network-based methods by constructing a fashion item relationship graph; (3) pair-based methods based on measuring the compatibility between two fashion items by a distance metric. Our model is most similar to the MCN [27], which is a pair-based method model as well. In summary, the above models adopt different types of state-of-the-art methods for fashion compatibility prediction.

As illustrated in Table 5, our CMFN outperforms the state-of-the-art methods across various evaluation metrics. Specifically, the ACC has exhibited a 5–10% improvement. Both

C^{2}

Anet and HAM use cross-modal attention and self-attention for multimodal fusion; however, the performance of HAM is not as good as that of

C^{2}

Anet. This may be because

C^{2}

Anet employs graph self-attention to mine relational information. Moreover, the performance of our model is superior to that of

C^{2}

Anet [28] and HAM [33]. However, the performance improvement of our method can be attributed to its ability to use Mamba to capture key long-distance visual features, which avoids the loss of detailed features caused by the global averaging of self-attention and enhances the modeling of implicit negative correlations between fashion items. We can conclude the following results.

(1) The sequence-based method performs the worst, which may be because it processes the items in the outfit in a fixed order, but the items within the outfit are unordered. This method is unable to effectively capture the complex correlations between the items, leading to poorer performance.

(2) Compared to the graph-based method, our method performs better. It may be because we consider the long-distance correlated visual features in the image. which is able to better understand the global correlations between the items, enhancing the overall semantic quality of the image.

(3) Among the pair-based methods, our model achieves the best results. It indicates that simultaneously considering the long-distance correlated visual features and the negative correlations between items is crucial, which helps to more accurately evaluate fashion combinations.

Furthermore, although the proposed CMFN only shows a limited improvement in the AUC metric, it still outperforms all baseline models, thus verifying the performance gains of the model. Meanwhile, the model achieves favorable accuracy (ACC) performance—as an intuitive performance evaluation criterion, this metric directly demonstrates the model’s strong capability in the task of sample category recognition. This result not only reflects the stability of the model in practical application scenarios but also compensates for the limited improvement of the AUC metric, thereby further verifying the effectiveness of the model design in this study. However, the observed improvement in AUC is relatively modest, which can be attributed to two main factors. First, the baseline evaluation metrics are already high, making further substantial gains inherently challenging. This trend is consistently reflected in recent cutting-edge studies, including

C^{2}

Anet [28], FCM-CMAN [52], and PS-OCM [51], where reported improvements in AUC have consistently remained below 0.1%. Second, fashion compatibility prediction is inherently subjective and complex, often requiring more extensive datasets and domain-specific knowledge. Under these circumstances, incremental model enhancements alone may have a limited effect on overall performance.

4.4. Ablation Experiments (Q2)

The ablation experiments are conducted to test the effect of three improvements by CMFN. First is the visual enhancement model that extracts long-distance visual features. The second one is the multimodal fusion model. Last is an investigation of negative correlation relationships based on Pearson’s correlation coefficient. Specific experimental designs are as follows.

W/O-VE: Only the visual features are used as the image features to eliminate the influence of the visual enhancement module.

W/O-MF: The multimodal fusion module of capturing consistency and complementarity information has been removed, leaving only the multimodal feature alignment to represent the multimodal information.

W/O-CE: The correlation relationships are computed based on multi-scale features, thereby eliminating the impact of the hidden negative correlation associations between different items.

The results shown in Table 6 indicate that our model outperforms the variant models, which verifies the effectiveness of the three modules. In particular, when there is no VE, the metric ACC significantly decreases, indicating that the long-distance correlation features extracted by the visual enhancement module can effectively improve the model’s accuracy in selecting the correct answer in the FITB task. Similarly, if it has no CE module, the metric AUC significantly drops, which confirms that the model can make more accurate fashion compatibility predictions after exploring negative correlations. More importantly, when there is no MF module, the model’s performance is lower than in any other case, which indicates that the dual-interaction design can effectively exploit the consistency and complementarity of multimodal information to achieve deep fusion of multimodal features.

4.5. Parameter Sensitivity Analysis (Q3)

In this section, we investigate the impact of key hyperparameters in our proposed CMFN, including the number of multimodal fusion layers μ, embedding dimension d, and learning rate lr.

To explore the impact of the number of multimodal fusion layers μ, we evaluate our model’s performance on two tasks and two datasets by incrementally changing μ from 1 to 6 with an increment of 1 each time. As shown in Figure 8a, the model achieves optimal performance when μ is 5. This indicates that the model can deeply integrate the consistency and complementarity information of images and text. When μ is less than 5, the performance improves. This could be because in the initial fusion layers, the model can typically capture effective features between different modalities quickly and perform feature integration. However, when μ exceeds 5, the performance declines. One possible reason is that too many layers may introduce redundant information, causing the model to become overly reliant on certain features, thereby reducing overall performance.

At the same time, we also studied the impact of embedding dimension, regularization rate, and learning rate on the model performance. In particular, we conducted experiments with embedding dimensions of [512, 1024, 2048, 4096] and learning rates of [1 × 10⁻¹, 1 × 10⁻², 1 × 10⁻³, 1 × 10⁻⁴]. Figure 8b,c show the effects of these parameters on the performance of the Polyvore Outfits-ND and Polyvore Outfits-D datasets. The model achieves the optimal performance when the value of dimension is set to 2048, and the learning rate is 1 × 10⁻². However, as seen in Figure 8c, the model is more sensitive to the learning rate, which indicates that the learning rate has a significant impact on model performance. Additionally, as the embedding dimension increases, the model exhibits a trend of first rising and then dropping, which may be due to insufficient features when the embedding dimension is low, while an excessively high embedding dimension can lead to feature overload.

4.6. Case Study (Q4)

For better understanding of our CMFN, we conduct case studies on two evaluation tasks as well.

Table 7 presents the results of four test samples for the FPC task. We can see that the first and second groups obtained low scores, which means they are incompatible. It is obvious that the color and style of the two sets do not match. The third and fourth groups have a more consistent style and color, thus receiving higher scores. Through the ablation study on each outfit, it can be seen that for the first and second groups, there are obvious negative correlations, such as the color conflict between the blue short-sleeved shirt and the bright red pants and the style conflict between the dress and the shoes. Therefore, when the CE module is removed in these two groups, the model’s prediction score is actually higher, because it fails to recognize these conflicts. For the third and fourth groups, the overall clothing details are more closely related. When the VE module is absent, the model lacks the ability to capture long-range association features, resulting in a compatibility score significantly lower than the complete model. It reveals that our method can provide a helpful compatibility assessment for fashion suits.

In Table 8, the left column sets the questions for the fill-in-the-blank (FITB) items, while the middle column presents the corresponding four answer options with the ground truth items marked by a black box. For example, in the first group, the first item in the middle column is the most suitable option since the whole group belongs to summer and has a more minimalist style. The second group appears to be more formal in terms of occasion, making the second item more appropriate. In the last group, considering the color matching, the first option is a better fit. Moreover, the case shows that only the CMFN and W/O-CE gave the correct choice in the first group. This indicates that W/O-CE plays a more obvious role when there is a conflict in matching. In the last two groups, only the CMFN gave the correct option, which highlights the importance of each component module. Overall, it can be seen that the CMFN can accurately select the missing item, demonstrating the effectiveness of the model. Furthermore, the FITB problem is randomly selected from the original fashion dataset. Our CMFN can select accurate options and is generally independent of the input order of fashion item sequences.

4.7. Generalization Ability Analysis (Q5)

To verify the generalization ability of the model, we conducted experiments on the training and test sets of the Polyvore Outfits-ND and Polyvore Outfits-D datasets. First, the training and test sets of these two datasets were obtained via random division, which ensures data independence and facilitates the evaluation of the model’s generalization ability. Second, the Polyvore Outfits-ND and Polyvore Outfits-D datasets correspond to non-disjoint and disjoint versions, respectively, depending on whether there is an overlap between the data in the training and test sets. There is no overlap between the training set and the test set of the Polyvore Outfits-D dataset, meaning that the model cannot access any samples from the test set during the training process. Such independence enables the test set to serve as a genuine evaluation benchmark, ensuring that the model performance does not rely on memorizing training data and thus verifying the model’s generalization ability. As shown in Table 9, the performance discrepancies in AUC and ACC between the test set and the training set of the same dataset are relatively small. For example, on the Polyvore Outfits-ND and Polyvore Outfits-D datasets, the differences in AUC and ACC metrics between the training set and the test set are all less than 0.01. This phenomenon indicates that our proposed CMFN model achieves consistent performance across different datasets. Such consistency reflects the model’s stability when confronted with unseen data, proving that the model possesses a certain degree of generalization ability.

5. Conclusions

Fashion compatibility assessment has emerged as a significant research focus in AI and computer vision, largely driven by its substantial commercial value for recommendation systems and styling services. This study addresses the challenge of fashion compatibility modeling using multimodal fusion, particularly concerning the complex correlations and relationships that exist in the multimodal data and items. We proposed a correlation-aware multimodal fusion network (CMFN) framework to achieve the task of fashion compatibility modeling. The proposed framework operates in three cohesive stages:

Feature Extraction and Alignment: Visual features are extracted via ResNet101 and textual features via word embedding. A visual/textual embedding module aligns these multi-scale features, while a CM-Mamba module enhances long-distance correlated feature representations.
Multimodal Fusion: The aligned features are processed to deeply integrate their consistent and complementary interactive relationships between multimodal information into a unified representation.
Compatibility Prediction: Compatibility is assessed by mining correlations across items, focusing on negative and multi-scale interactions, followed by a final prediction from an MLP.

The key contributions of this research are summarized as follows:

The first lies in effectively capturing long-distance correlated features by adopting a state-space model to extract these features and employing a dynamic weighting mechanism for critical correlations. The results reveal that it can ultimately enhance the quality of visual representations.
The second contribution addresses the challenge of deep multimodal fusion between visually and textually heterogeneous data (features). We propose a dual-interaction mechanism that effectively captures critical intermodal relationships, such as complementarity and consistency, thereby facilitating deep integration and enabling more effective multimodal learning.
To address the challenge of modeling multifaceted item correlations, including negative and multi-scale relationships, a unified metric is proposed, which fuses measurements from Pearson’s correlation coefficient and cosine similarity to provide a comprehensive basis for assessment.
Extensive experiments on the Polyvore Outfits-ND and Polyvore Outfits-D datasets show that our model outperforms state-of-the-art methods in terms of both AUC and ACC metrics. Furthermore, ablation studies conducted on the constituent modules verify the individual contribution of each component within the architecture. The generalization capability of the model is also systematically validated through comparative analysis between its performance on the training and test sets.

Future research should explore the mechanisms for providing personalized fashion suit recommendations to specific users. However, in the era of big data, it is crucial to gain in-depth insights into users’ behavioral information regarding their preferences for specific fashion products while protecting user privacy, which will facilitate the development of the personalized fashion industry. In addition, another limitation of this study is that it does not currently take into account the attribute context of fashion items, and we plan to further optimize the performance of the model by introducing the attribute context information in future research. Last but not least, the compatibility assessment workflow can also be viewed as a protocol or service. Attempts can be made to deconstruct the CMFN framework into clearer phased modules, which can enhance the interpretability of the system [53].

Author Contributions

Conceptualization, Y.F.; methodology, Y.F. and J.G.; software, J.G.; validation, J.G. and R.X.; formal analysis, R.X.; investigation, J.G.; resources, Y.Z.; data curation, R.X.; writing—original draft preparation, Y.F. and J.G.; writing—review and editing, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are openly available in reference [32].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, X.; Song, X.; Jing, L.; Li, S.; Hu, L.; Nie, L. Multi-modal dialog systems with dual knowledge-enhanced generative pretrained language model. ACM Trans. Inf. Syst. 2023, 42, 53. [Google Scholar]
Wang, L.; Zhang, C.; Xu, H.; Xu, Y.; Xu, X.; Wang, S. Cross-modal contrastive learning for multi-modal fake news detection. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 5696–5704. [Google Scholar]
Guan, W.; Wen, H.; Song, X.; Yeh, C.H.; Chang, X.; Nie, L. Multi-modal compatibility modeling via exploring the consistent and complementary correlations. In Proceedings of the 29th ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 2299–2307. [Google Scholar]
Song, X.; Fang, S.T.; Chen, X.; Wei, Y.; Zhao, Z.; Nie, L. Modality-oriented graph learning toward outfit compatibility modeling. IEEE Trans. Multimedia 2021, 25, 856–867. [Google Scholar] [CrossRef]
Shen, E.; Lieberman, H.; Lam, F. What am I gonna wear? Scenario-oriented recommendation. In Proceedings of the 12th International Conference on Intelligent User Interfaces, Honolulu, HI, USA, 28–31 January 2007; pp. 365–368. [Google Scholar]
Zhao, Y.; Araki, K. What to Wear in Different Situations? A Content-based Recommendation System for Fashion Coordination. In Proceedings of the Japanese Forum on Information Technology (FIT2011), Hokkaido, Japan, 7–9 September 2011. [Google Scholar]
Lin, Y.; Wang, T. Dress up like a stylist? Learning from a user-generated fashion network. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, Singapore, 6–10 November 2017. [Google Scholar]
Liu, S.; Feng, J.; Song, Z.; Zhang, T.; Lu, H.; Xu, C.; Yan, S. Hi, magic closet, tell me what to wear! In Proceedings of the 20th ACM International Conference on Multimedia, Nara, Japan, 29 October–2 November 2012; pp. 619–628. [Google Scholar]
McAuley, J.; Targett, C.; Shi, Q.; Van Den Hengel, A. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, Santiago, Chile, 9–13 August 2015; pp. 43–52. [Google Scholar]
He, R.; Packer, C.; McAuley, J. Learning compatibility across categories for heterogeneous item recommendation. In Proceedings of the 2016 IEEE 16th International Conference on Data Mining (ICDM), Barcelona, Spain, 12–15 December 2016; IEEE: New York City, NY, USA, 2016; pp. 937–942. [Google Scholar]
Veit, A.; Kovacs, B.; Bell, S.; McAuley, J.; Bala, K.; Belongie, S. Learning visual clothing style with heterogeneous dyadic co-occurrences. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 4642–4650. [Google Scholar]
Song, X.; Feng, F.; Han, X.; Yang, X.; Liu, W.; Nie, L. Neural compatibility modeling with attentive knowledge distillation. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July 2018; pp. 5–14. [Google Scholar]
Wei, Y.; Wang, X.; Guan, W.; Nie, L.; Lin, Z.; Chen, B. Neural multimodal cooperative learning toward micro-video understanding. IEEE Trans. Image Process. 2019, 29, 1–14. [Google Scholar] [CrossRef] [PubMed]
Ma, X.; Zhang, X.; Pun, M.O.; Liu, M. A multilevel multimodal fusion transformer for remote sensing semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5403215. [Google Scholar] [CrossRef]
Cui, Z.; Li, Z.; Wu, S.; Zhang, X.Y.; Wang, L. Dressing as a whole: Outfit compatibility learning based on node-wise graph neural networks. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 3–17 May 2019; pp. 307–317. [Google Scholar]
Lu, Z.; Hu, Y.; Jiang, Y.; Chen, Y.; Zeng, B. Learning binary code for personalized fashion recommendation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10562–10570. [Google Scholar]
Yang, X.; Ma, Y.; Liao, L.; Wang, M.; Chua, T.-S. Transnfcm: Translation-based neural fashion compatibility modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 403–410. [Google Scholar]
Zhan, H.; Lin, J.; Ak, K.E.; Shi, B.; Duan, L.Y.; Kot, A.C. A³-FKG: Attentive Attribute-Aware Fashion Knowledge Graph for Outfit Preference Prediction. IEEE Trans. Multimed. 2021, 24, 819–831. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (long and short papers), pp. 4171–4186. [Google Scholar]
Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multi-modal transformer for unaligned multi-modal language sequences. In Proceedings of the Conference Association for Computational Linguistics. Meeting 2019, 2019, 6558. [Google Scholar]
Yang, Z.; He, X.; Gao, J.; Deng, L.; Smola. A Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 30 June 2016; pp. 21–29. [Google Scholar]
Lee, K.H.; Chen, X.; Hua, G.; Hu, H.; He, X. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 201–216. [Google Scholar]
Laenen, K.; Moens, M.F. A comparative study of outfit recommendation methods with a focus on attention-based fusion. Inf. Process. Manag. 2020, 57, 102316. [Google Scholar] [CrossRef]
Tang, H.; Li, Z.; Zhang, D.; He, S.; Tang, J. Divide-and-conquer: Confluent triple-flow network for RGB-T salient object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 47, 1958–1974. [Google Scholar] [CrossRef]
Zhao, Y.; Zheng, Q.; Zhu, P.; Zhang, X.; Ma, W. TUFusion: A transformer-based universal fusion algorithm for multimodal images. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 1712–1725. [Google Scholar] [CrossRef]
Li, X.; Wang, X.; He, X.; Chen, L.; Xiao, J.; Chua, T.S. Hierarchical fashion graph network for personalized outfit recommendation. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, China, 25 July 2020; pp. 159–168. [Google Scholar]
Li, Z.; Li, J.; Wang, T.; Gong, X.; Wei, Y.; Luo, P. Ocphn: Outfit compatibility prediction with hypergraph networks. Mathematics 2022, 10, 3913. [Google Scholar] [CrossRef]
Cui, K.; Liu, S.; Feng, W.; Deng, X.; Gao, L.; Cheng, M.; Lu, H.; Yang, L.T. Correlation-aware cross-modal attention network for fashion compatibility modeling in ugc systems. ACM Trans. Multimed. Comput. Commun. Appl. 2025, 21, 318. [Google Scholar] [CrossRef]
Kim, D.; Han, D.; Roh, D.; Han, K.; Yi, M.Y. Less is More: A Streamlined Graph-Based Fashion Outfit Recommendation without Multimodal Dependency. In Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing, Avila, Spain, 8–12 April 2024; pp. 492–495. [Google Scholar]
Wang, X.; Wu, B.; Zhong, Y. Outfit compatibility prediction and diagnosis with multi-layered comparison network. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 329–337. [Google Scholar]
Vasileva, M.I.; Plummer, B.A.; Dusad, K.; Rajpal, S.; Kumar, R.; Forsyth, D. Learning type-aware embeddings for fashion compatibility. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 390–405. [Google Scholar]
Li, Y.; Li, G.X.; Zhang, J.; Jing, P.; Lu, X. Research on type-aware fashion compatibility prediction based on a hybrid attention mechanism. Multimed. Tools Appl. 2024, 83, 74003–74020. [Google Scholar] [CrossRef]
Tan, R.; Vasileva, M.I.; Saenko, K.; Plummer, B.A. Learning similarity conditions without explicit supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 10373–10382. [Google Scholar]
Erden, C. Genetic algorithm-based hyperparameter optimization of deep learning models for PM2. 5 time-series prediction. Int. J. Environ. Sci. Technol. 2023, 20, 2959–2982. [Google Scholar] [CrossRef]
Guo, S.; Du, Q. Structure parameters optimization of metasurface for broadband polarization conversion based on deep learning and optimization algorithm. In Proceedings of the 2024 5th International Conference on Computer Engineering and Application (ICCEA), Hangzhou, China, 12–14 April 2024; IEEE: New York City, NY, USA, 2024; pp. 1724–1729. [Google Scholar]
Vianny, D.M.M.; Vaddadi, S.A.; Karthikeyan, C.; Shahid, M.; Dhanapal, R.; Ravichand, M. Drug-based recommendation system based on deep learning approach for data optimization. Soft Comput. 2023, 1–9. [Google Scholar] [CrossRef]
Assiri, H. Piranha foraging optimization algorithm with deep learning enabled fault detection in blockchain-assisted sustainable iot environment. Sustainability 2025, 17, 1362. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Lin, M.; Chen, Q.; Yan, S. Network in network. arXiv 2013, arXiv:1312.4400. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar] [CrossRef]
Kiros, R.; Salakhutdinov, R.; Zemel, R.S. Unifying visual-semantic embeddings with multi-modal neural language models. arXiv 2014, arXiv:1411.2539. [Google Scholar]
He, X.; Cao, K.; Zhang, J.; Yan, K.; Wang, Y.; Li, R.; Xie, C.; Hong, D.; Zhou, M. Pan-mamba: Effective pan-sharpening with state space model. Inf. Fusion 2025, 115, 102779. [Google Scholar] [CrossRef]
Lin, H.; Zhang, P.; Ling, J.; Yang, Z.; Lee, L.K.; Liu, W. PS-mixer: A polar-vector and strength-vector mixer model for multi-modal sentiment analysis. Inf. Process. Manag. 2023, 60, 103229. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chen, H.; Yin, H.; Wang, W.; Wang, H.; Nguyen, Q.V.H.; Li, X. PME: Projected metric embedding on heterogeneous networks for link prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 1177–1186. [Google Scholar]
Inoue, N.; Simo-Serra, E.; Yamasaki, T.; Ishikawa, H. Multi-label fashion image classification with minimal human supervision. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 2261–2267. [Google Scholar]
Guo, S.; Huang, W.; Zhang, X.; Srikhanta, P.; Cui, Y.; Li, Y.; Adam, H.; Scott, M.; Belongie, S. The imaterialist fashion attribute dataset. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Han, X.; Wu, Z.; Jiang, Y.G.; Davis, L.S. Learning fashion compatibility with bidirectional lstms. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 1078–1086. [Google Scholar]
Su, T.; Song, X.; Zheng, N.; Guan, W.; Li, Y.; Nie, L. Complementary factorization towards outfit compatibility modeling. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, China, 20–24 October 2021; pp. 4073–4081. [Google Scholar]
Guan, W.; Wen, H.; Song, X.; Wang, C.; Yeh, C.H.; Chang, X.; Nie, L. Partially supervised compatibility modeling. IEEE Trans. Image Process. 2022, 31, 4733–4745. [Google Scholar] [CrossRef] [PubMed]
Jing, P.; Cui, K.; Guan, W.; Nie, L.; Su, Y. Category-aware multi-modal attention network for fashion compatibility modeling. IEEE Trans. Multimed. 2023, 25, 9120–9131. [Google Scholar] [CrossRef]
Yenugula, M.; Kasula, V.K.; Yadulla, A.R.; Konda, B.; Addula, S.R.; Kotteti, C.M.M. Privacy-Preserving Decision Tree Classification Using Homomorphic Encryption in IoT Big Data Scenarios. In Proceedings of the 2025 IEEE 4th International Conference on Computing and Machine Intelligence (ICMI), Mount Pleasant, MI, USA, 5–6 April 2025; IEEE: New York City, NY, USA, 2025; pp. 1–6. [Google Scholar]

Figure 1. An example of outfit compositions.

Figure 2. Examples of long-distance correlated features in images.

Figure 3. Example of interactive relationships between multimodal information.

Figure 4. Examples of complex correlations among items.

Figure 5. Schematic diagram of the CMFN framework.

Figure 6. Multimodal consistency/complementarity processing.

Figure 7. Two tasks for evaluating the compatibility prediction performance.

Figure 8. (a) Impact of the hyperparameter μ on AUC and ACC; (b) impact of the hyperparameter d on AUC and ACC; (c) impact of the hyperparameter lr on AUC and ACC.

Table 1. Summary of the main notations.

Notations	Explanations
N	Categories of each fashion outfit.
G	CMFN model.
$p^{m}$	The m-th fashion outfit.
$y^{m^{'}}$	The predicted fashion compatibility label of the m-th fashion outfit.
$x_{i}$	The i-th category item in an outfit.
$v_{i}$	The visual feature of fashion $x_{i}$ .
$e_{φ}$	The word vector of the $\emptyset$ -th word in the text of the i-th item.
$h_{φ}$	The word embedding of the $\emptyset$ -th word in the text of the i-th item.
$t_{i}$	The textual feature of fashion $x_{i}$ .
$v_{i}^{k}$	The multi-scale visual feature of $x_{i}$ .
$v_{i}^{'}$	The enhanced visual feature of $x_{i}$ .
$f_{i}$	The multimodal fused feature of $x_{i}$ .
$\emptyset_{m}$	The multi-scale correlation representation of the m-th fashion outfit.
$F_{m}$	The negative correlation representation of the m-th fashion outfit.

Table 2. Statistics of the two datasets.

Version	Name of the Dataset	No. of Outfits	No. of Items	No. of Outfits in the Training Set	No. of Outfits in the Validation Set	No. of Outfits in the Test Set
Non-disjoint	Polyvore Outfit-ND	68,306	365,054	53,306	5000	10,000
Disjoint	Polyvore Outfit-D	32,140	175,485	16,995	3000	15,145

Table 3. Experimental parameter settings.

Parameters	Explanations	Value
$d_{t}$	the dimension of textual features	2048
$d_{v}$	the dimension of visual features	2048
$μ$	the number of layers in the multimodal fusion module	5
batch size	the number of samples used in each training	44
epochs	the training period	50
lr	the learning rate	1 × 10⁻²
step_size	the decay period of the learning rate	10
gamma	the decay rate	0.5
$α$	the weight of the type mask loss	5 × 10⁻⁴
$β$	the weight of the feature vector loss	5 × 10⁻³
$γ$	the weight of the visual embedding loss	1
$u$	the margin of the visual textual loss	0.2

Table 4. Introduction to the baseline models.

Method Types	Models	Reference	Methods
Sequence-based	Bi-LSTM	MM 2017 [49]	Bi-LSTM
Graph-based	NGNN	WWW 2019 [15]	GNN
	OCM-CF	MM 2021 [50]	GCN and Attention
	PS-OCM	TIP 2022 [51]	HGCN
	$C^{2}$ Anet	ACM 2025 [28]	GAT and Attention
Pair-based	Type-aware	ECCV 2018 [32]	Type Embedding
	SCE-Net	ICCV 2019 [34]	Condition-aware Embedding
	MCN	ACM 2019 [31]	Multilayered Comparison Network
	FCM-CMAN	TMM 2023 [52]	Attention Network
	HAM	MTA 2024 [33]	Hybrid Attention
	Our CMFN		Multimodal Fusion Network

Table 5. Results of each model. The best results are highlighted in bold.

Method Types	Methods	Polyvore Outfits-ND		Polyvore Outfits-D
Method Types	Methods	AUC	ACC	AUC	ACC
Sequence-based	Bi-LSTM [49]	0.6624	0.3811	0.6272	0.3743
Graph-based	NGNN [15]	0.8712	0.5179	0.8361	0.4837
	OCM-CF [50]	0.9182	0.6362	0.8600	0.5659
	PS-OCM [51]	0.9196	0.6183	0.8753	0.5641
	$C^{2}$ Anet [28]	0.9271	0.6235	0.8842	0.5678
Pair-based	Type-aware [32]	0.8723	0.5778	0.8449	0.5585
	SCE-Net [34]	0.8709	0.5780	0.8422	0.5544
	MCN [31]	0.9056	0.6483	0.8591	0.5815
	FCM-CMAN [52]	0.9203	0.6186	0.8726	0.5617
	HAM [33]	0.8400	0.5310	0.8200	0.5130
	Our CMFN	0.9286	0.6885	0.8912	0.6358
	Improvement (%)	0.1618	6.201	0.7917	9.338

Table 6. Performance comparisons between various component configurations.

Methods	Polyvore Outfits-ND		Polyvore Outfits-D
Methods	AUC	ACC	AUC	ACC
W/O-VE	0.9153	0.6712	0.8832	0.6338
W/O-MF	0.9089	0.6634	0.8678	0.5997
W/O-CE	0.9186	0.6805	0.8782	0.6239
Ours	0.9286	0.6885	0.8912	0.6358

Table 7. Case study for FCP.

List of Fashion Items for Given Outfits	Compatibility Score
	0.1165	(CMFN)
	0.1198	(W/O-VE)
	0.1212	(W/O-MF)
	0.2105	(W/O-CE)
	0.3618	(CMFN)
	0.3708	(W/O-VE)
	0.3712	(W/O-MF)
	0.4015	(W/O-CE)
	0.9219	(CMFN)
	0.7608	(W/O-VE)
	0.8912	(W/O-MF)
	0.9115	(W/O-CE)
	0.9069	(CMFN)
	0.8208	(W/O-VE)
	0.8812	(W/O-MF)
	0.9001	(W/O-CE)

Table 8. Case study for FITB.

Questions	Answers	Methods
		CMFN	A
		W/O-VE	C
		W/O-MF	B
		W/O-CE	A
		CMFN	B
		W/O-VE	A
		W/O-MF	C
		W/O-CE	A
		CMFN	A
		W/O-VE	B
		W/O-MF	B
		W/O-CE	D

Table 9. Generalization Ability Analysis.

Datasets	AUC	ACC
Polyvore Outfits-ND (training)	0.9320	0.6952
Polyvore Outfits-ND (test)	0.9286	0.6885
Polyvore Outfits-D (training)	0.9078	0.6432
Polyvore Outfits-D (test)	0.8912	0.6358

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fang, Y.; Ge, J.; Xiao, R.; Zhang, Y. Correlation-Aware Multimodal Fusion Network for Fashion Compatibility Modeling. Electronics 2026, 15, 332. https://doi.org/10.3390/electronics15020332

AMA Style

Fang Y, Ge J, Xiao R, Zhang Y. Correlation-Aware Multimodal Fusion Network for Fashion Compatibility Modeling. Electronics. 2026; 15(2):332. https://doi.org/10.3390/electronics15020332

Chicago/Turabian Style

Fang, Yan, Jiangnan Ge, Ran Xiao, and Yidan Zhang. 2026. "Correlation-Aware Multimodal Fusion Network for Fashion Compatibility Modeling" Electronics 15, no. 2: 332. https://doi.org/10.3390/electronics15020332

APA Style

Fang, Y., Ge, J., Xiao, R., & Zhang, Y. (2026). Correlation-Aware Multimodal Fusion Network for Fashion Compatibility Modeling. Electronics, 15(2), 332. https://doi.org/10.3390/electronics15020332

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Correlation-Aware Multimodal Fusion Network for Fashion Compatibility Modeling

Abstract

1. Introduction

2. Related Work

2.1. Modality Processing

2.2. Compatibility Evaluation

2.3. Summary

3. Proposed Methodology

3.1. Problem Formulation and Notations

3.2. Multimodal Feature Extraction

3.3. Multimodal Alignment

3.4. Visual Enhancement

3.5. Multimodal Fusion

3.6. Compatibility Evaluation

3.7. Loss Function

4. Experiments

4.1. Datasets

4.2. Design of the Experiments

4.2.1. Evaluation Tasks and Metrics

4.2.2. Experimental Parameters

4.3. Model Comparisons (Q1)

4.4. Ablation Experiments (Q2)

4.5. Parameter Sensitivity Analysis (Q3)

4.6. Case Study (Q4)

4.7. Generalization Ability Analysis (Q5)

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI