An Underwater Multi-Label Classification Algorithm Based on a Bilayer Graph Convolution Learning Network with Constrained Codec

: Within the domain of multi-label classification for micro-videos, utilizing terrestrial datasets as a foundation, researchers have embarked on profound endeavors yielding extraordinary accomplishments. The research into multi-label classification based on underwater micro-video datasets is still in the preliminary stage. There are some challenges: the severe color distortion and visual blur-ring in underwater visual imaging due to water molecular scattering and absorption, the difficulty in acquiring underwater short video datasets, the sparsity of underwater short video modality features, and the formidable task of achieving high-precision underwater multi-label classification. To address these issues, a bilayer graph convolution learning network based on constrained codec (BGCLN) is established in this paper. Specifically, modality-common representation is constructed to complete the representation of common information and specific information based on the constrained codec network. Then, the attention-driven double-layer graph convolutional network module is designed to mine the correlation information between labels and enhance the modality representation. Finally, the combined modality representation fusion and multi-label classification module are used to obtain the category classifier prediction. In the underwater video multi-label classification dataset (UVMCD), the effectiveness and high classification accuracy of the proposed BGCLN have been proved by numerous experiments.


Introduction
In recent years, micro-video, as a new media form of user-generated content, is rapidly becoming one of the mainstream trends of social media, with its short, real, and instant-sharing characteristics.Micro-videos are rich in content and concise, serving as a combination of various modalities such as vision, audio, and text, and thus contain a vast amount of information.Therefore, it is of great significance to make full use of their multi-modal information for data mining and intelligent analysis.Making full use of multi-modal information focuses on mining the consistency and complementarity among multi-modal information to enhance the information representation.For example, when a micro-video shows dolphins swimming in the ocean, the audio of the micro-video is usually accompanied by the sound of the dolphins, but it is difficult to get information about the sea through the sound.In the above example, "dolphin" is information expressed by visual and acoustic modalities, representing consistency, while "ocean" is information unique to the visual modality, representing complementarity.Therefore, fully utilizing the consistency and complementarity of multi-modal information to enhance information representation is an essential step in the classification task.
The research directions for multi-label classification of micro-videos mainly include the following: (1) Micro-Video Scene Recognition.Nie et al. [1] proposed a deep migration model to accomplish the task of micro-video scene category estimation.This model introduces external acoustic knowledge to compensate for relatively low-quality audio modality, thus enhancing the semantic representation of micro-videos.(2) Micro-Video Event Detection.Chang et al. [2] defined the concept of semantic salience to evaluate the correlation of each video segment with the event of interest.They prioritized video segments based on saliency scores and leveraged the constructed semantic ranking information to improve the model's discriminative ability in event analysis tasks.(3) Micro-Video Prevalence Prediction.Chen et al. [3] proposed a direct multi-modal learning model that seeks an optimal latent common subspace among different modalities to alleviate the information insufficiency issue arising from the short duration of micro-videos, thereby facilitating better representation.(4) Micro-Video Recommendation.Wei et al. [4] designed a multi-modal graph convolutional network framework.By constructing a binary graph of user micro-videos for each modality, the authors used the interaction behavior of users and micro-videos to guide the representation learning of each modality, further capturing users' fine-grained preferences in different modalities.
It can be seen that the scientific research into the multi-label classification of microvideos has made achievements.However, the related work in the field of micro-videos is only limited to the consistency or complementarity between multi-modalities, while considering the consistency, complementarity, and multi-modal characterization.Furthermore, while these methods demonstrate promising performance in terrestrial video datasets, underwater environments pose significant challenges, such as color distortion, blurriness, and occlusion, which severely degrade video quality and consequently lead to a decline in detection and classification accuracy [5].Moreover, the underwater dataset is difficult to obtain, resulting in sparsely available modality information, and it is difficult to jointly mine the multi-modal complementarity and consistency enhancement information characterization.Consequently, exploring multimodal learning approaches for underwater micro-videos that can comprehensively utilize both the consistency and complementarity of different modalities, fostering mutual reinforcement among modal information while mitigating redundancies, holds significant importance for scientific research on marine imagery.
For the multi-modal multi-label classification task of underwater micro-videos, this work makes three contributions: (1) An underwater video multi-label classification dataset (UVMCD) is constructed, containing 3841 underwater videos covering 19 underwater categories.There were eight video classification methods used to benchmark the availability of the dataset.
(2) In the original modality features, the common information between the modalities and the specific information within the modalities are intertwined, and the redundancy between them will even contaminate the extracted representation.Therefore, it is necessary to explore the multi-modal representation learning methods that can separate these two parts of the information from the original information and minimize the redundancy.
(3) For multi-label learning, it is inevitable to consider the correlation between label categories.It is noted that there may be locality in the correlation between labels, whereby different instance groups share different label correlations rather than being globally applicable.Therefore, methods that can learn label correlations based on global and local adaptations need to be explored.

Related Work
Multi-label technology has been developed for many years.In the paradigm of multilabel classification, each object is associated with multiple labels simultaneously; the task of multi-label classification methods is to learn a function that can predict the corresponding label set of an input instance.Multi-label classification techniques have been widely used in various real-life scenarios such as medical diagnosis [6], bio-informatics [7], user analysis [8], and autonomous driving [9,10].
One of the main objectives of traditional multi-label classification methods is to expand the migration of mature single-label classification algorithms to the multi-label field.According to different expansion ideas, this can be divided into two categories: the problem conversion method and the algorithm conversion method.The basic idea of the problem conversion method is to transform the multi-label classification task into one of multi-label or more single-label classification tasks.The binary association method (binary relevance, BR) [11] in a multi-label classification task is transformed into multiple independent singlelabel binary classification tasks, and finally, the output of each binary classifier is aggregated to obtain the final multi-label prediction results.The basic idea of the algorithm adaptive method is to improve and extend the existing single-label classification algorithm so that it can process multi-label data, and then complete the task of multi-label classification.The algorithm models often used for the extension include k-Nearest Neighbors (KNN), decision trees, support vector machines (SVMs), neural networks, etc. Traditional methods often overlook or adopt rudimentary approaches to account for label correlations.
In recent years, deep learning technology has made great progress, and it has made breakthroughs in many application scenarios, such as a new model based on a deep neural network (DNN) first proposed by Yeh et al. [12].The standard relevant autoencoder (Canonical Correlated Autoencoder, C2AE) integrates typical correlation analysis (Canonical Correlation Analysis, CCA) to derive deep latent spatial joint features and label embedding to better associate features and label domain data to improve classification performance.Fei et al. [13] proposed a latent sentiment memory network (LSMN) tailored for the multi-label sentiment classification of texts.This network is capable of learning the distribution of latent sentiments, without relying on external knowledge, and effectively integrating them into the classification network.
Multi-modal representation learning aims to represent and extract effective semantic information, and the heterogeneous differences between data of different modes are a major challenge in constructing multi-modal representations.It is usually divided into two categories of methods: joint representation and coordinated representation.The joint representation projects the multi-modal data into a common representation space.Rajagopalan et al. [14] designed a multi-view LSTM network for multimodal action recognition and image captioning tasks, which explicitly learns the changes in specific views and crossview interactions over time or structured outputs.In contrast, coordinated representation methods process each modality independently.Fan et al. [15] combined CCA with a generative adversarial network (GAN) to propose a deep adversarial CCA model, which can simultaneously learn representations of multi-view data while possessing the ability to generate authentic multi-view samples.
Multi-modal fusion is one of the most studied directions in the field of multi-modal learning.The stage of modality fusion is divided into early fusion (feature-level fusion), late fusion (decision-level fusion), and hybrid fusion between the two.Srivastava et al. [16] proposed a multi-modal data generation model based on the deep Boltzmann machine (DBM) to generate a feature representation of missing modes and combine cross-modal features to create fusion features to complete the classification and information-retrieval tasks.The conventional multi-modal fusion method mainly uses kernel learning [17], graphical model [18], and CCA [19].However, due to the powerful flexibility of deep learning models, the traditional classification method is no longer suitable for deep learning multi-modal fusion-based methods.Deep automatic encoder (DAE) aims to encode the input data for meaningful compression; it consists of two parts, an encoder and a decoder.The encoder converts high-dimensional input data mapping to low-dimensional space to get the potential space representation, and the decoder decodes the potential space representation to reconstruct the original input data.Shen et al. [20] proposed the focus multi-modal DAE model for the extraction of multi-modal social media content (such as text, images, and micro-video, etc.), data learning cross-modal potential representation, and using the attention mechanism integration with variable weight user global and context music preferences, converting the social media content data into the music recommendation task.Guo et al. [21] put forward the standardization of the attention mechanism and geometric perception of improvement, the self-attention mechanism to parameterize, and the latter extending the attention mechanism to explicitly consider the relative geometric relationship of the input object; the video description, machine translation, and visual quiz task verified the generality of the improvement.Azad et al. [22] proposed a multi-label video classification model, using a self-attention mechanism to capture spatiotemporal attention in continuous video frames, to improve existing methods that consider the spatial information of only a single frame in underwater hull video inspection.Sun et al. [23] introduced a single-channel multi-target underwater acoustic signal recognition method based on deep learning, aiming to address two subproblems, identifying the unique and repetitive categories of multiple targets within a specified class.Leveraging the multimodal information embedded in videos, algorithms for video multi-label classification based on multimodal fusion were proposed.Le et al. [24] transmitted multimodal information through a unified transformer architecture to learn joint multimodal representations for multi-label video sentiment recognition.Cai et al. [25] proposed a multi-modal movietype classification framework, which makes full use of the information complementarity in the multi-modalities and improves the classification performance.For multi-modal learning, multi-modal data describe the same concept object from different levels and perspectives, which can often complement each other.However, there are heterogeneous differences between modality data from different information sources, which hinders the direct information interaction between modalities and masks the intrinsic strong correlation and semantic consistency between them.In addition, when each mode interacts, the noise information in a certain mode data may pollute other modes with the interaction process and lead to a decrease in model performance.Therefore, how to identify the sparsity of underwater information, how to reduce the heterogeneity between modalities, and mine the intrinsic consistent and complementary semantic information of the data of different modes, while removing redundant information and filtering noise, are the main challenges for researchers.

The Algorithm Model
The overall framework of a bilayer graph convolution learning network based on a constrained codec (BGCLN) is shown in Figure 1, which can be roughly divided into the following three parts: Assume the existence of a micro-video collection, χ = x 1 , x 2 , . . ., x N , as the training data, which comprise a total of N micro-video samples.For the i(i = 1, 2, . . ., N)-th microvideo sample in this collection, pre-extracted multi-modal features can be used to represent it as x i m ∈ R D m i = 1, 2, . . ., N}, m ∈ {v, t , a} , where D m denotes the dimensionality of the multi-modal features, and m serves as a modality indicator, with values v, t, and a representing the visual, trajectory, and acoustic modalities, respectively.Additionally, the true label of x i can be described by a binary label vector, y i ∈ {0, 1} C , where 1 indicates the presence of the label in x i , and 0 indicates its absence.The total number of label categories is C. Without the loss of generality, x m is used hereinafter to represent the multi-modal features corresponding to any given micro-video sample in the collection.Assume the existence of a micro-video collection, { } 1 2 N χ = x ,x ,...,x , as the training data, which comprise a total of N micro-video samples.For the i (i = 1,2,...,N) -th micro-video sample in this collection, pre-extracted multi-modal features can be used to rep- , v, t , , where m D denotes the dimension- ality of the multi-modal features, and m serves as a modality indicator, with values v , t , and a representing the visual, trajectory, and acoustic modalities, respectively.Addi- tionally, the true label of i x can be described by a binary label vector, ∈ C i y {0,1} , where 1 indicates the presence of the label in i x , and 0 indicates its absence.The total number of label categories is C .Without the loss of generality, m x is used hereinafter to repre- sent the multi-modal features corresponding to any given micro-video sample in the collection.

Modality-Specific Representation and Modality-Common Representation Learning Modules
Given the consistency and complementarity among the information of different modalities, this paper proposes a codec network structure with orthogonal, adversarial similarity, and reconstruction constraints to learn the modality-common representation and the specific representation of each mode.The main body of this module consists of a modality-specific encoder, modality-common encoder, modality-common decoder, and modality discriminator.
First, the visual, trajectory, and acoustic modality features, m x , are input into the private coding network corresponding to each modality (as shown in Figure 2) to obtain the specific representation, s m z , of each mode: where m d is the number of dimensions represented by each modality after encoding, and the private feature

Modality-Specific Representation and Modality-Common Representation Learning Modules
Given the consistency and complementarity among the information of different modalities, this paper proposes a codec network structure with orthogonal, adversarial similarity, and reconstruction constraints to learn the modality-common representation and the specific representation of each mode.The main body of this module consists of a modality-specific encoder, modality-common encoder, modality-common decoder, and modality discriminator.
First, the visual, trajectory, and acoustic modality features, x m , are input into the private coding network corresponding to each modality (as shown in Figure 2) to obtain the specific representation, z s m , of each mode: where d m is the number of dimensions represented by each modality after encoding, and the private feature E m (•), the encoder of the corresponding mode θ m , and the encoder are stacked out of multiple fully connected layers.

Orthogonal Constraints
Inspired by the work [26,27], the introduction of orthogonal constraints to promote a common coding network and private coding network to explore the different aspects of input modality characteristics, separating the common and specific information between modalities, means the modality-specific representation does not contain shared information, as much as possible.The orthogonal loss is used to measure the magnitude of similarity between the common representation and the specific representation of each modality, which can be defined by the formula: At the same time, visual, trajectory, and acoustic modality features, x m , are input into the modality-common coding network to generate common representations based on each modality: where the number of dimensions publicly expressed by the encoding modality d c , representing the modality-common feature encoder E c (•), with the same network structure E m (•), θ c is the learnable network parameters of the encoder.Based on the consideration of model consistency, the modality-sharing feature encoder is committed to extracting the common information between the modalities as modality-common representation, so the public representation learned from different modalities should be consistent in the model training process, which can take the public representation generated by the modality average as the final modality-common representation: where M = 3, means that the modality uses the features of the three modalities as input.

Orthogonal Constraints
Inspired by the work [26,27], the introduction of orthogonal constraints to promote a common coding network and private coding network to explore the different aspects of input modality characteristics, separating the common and specific information between modalities, means the modality-specific representation does not contain shared information, as much as possible.The orthogonal loss is used to measure the magnitude of similarity between the common representation and the specific representation of each modality, which can be defined by the formula: where ∥•∥ F represents the Frobenius norm.At the time z s v = z c v , z s t = z c t , z s a = z c a , the value of the orthogonal loss reaches the maximum, L Ort .Therefore, the modality-specific representation, L Ort , can be orthogonal to the modality-common representation, as much as possible, to separate the two and reduce the redundant information.

Adversarial Similarity Constraints
Considering the consistency between modalities, the common representations learned from different modality features should be as consistent as possible.Inspired by other work [26,28,29], we design adversarial similarity constraints based on adversarial training ideas.Specifically, the modality-common coding network is treated as a generator, G(x m ; θ c ), and a class classifier, D(z c m ; θ d ), is introduced as modality discriminator M for its learnable network parameters, θ d .The modality discriminant D takes the public representations generated based on different modes as input, and is committed to correctly identifying their source modalities, while the generator G aims to generate the common representations that can confuse the judgment of D. The two learn against each other in the training process, so that the common representations generated by different modalities tend to be consistent.
Inspired by [30], the gradient reversal layer (GRL) is used to achieve this adversarial learning by redefining the backward function of the module.The network introducing the GRL remains consistent with the original network during forward propagation, but during backpropagation, the gradient is multiplied by a negative constant to reverse the gradient direction.Therefore, the gradient direction of the learnable network parameters θ c , of the GRL inversion generator G, is used to form a confrontation between the discriminator and the generator.
The formula for the antagonistic similarity loss, L Adv , is as follows: where the source modality, l m i ∈ {0, 1}, of the current public m representation z c m is described, the common representation is generated by the currently indicated modality, which is the probability, P m i (z c m ), that the common representation comes from the indicated m modality predicted by the mode discriminator D. In essence, the generator G is dedicated to minimization, L Adv the mode D, and the discriminant is dedicated to maximizing L Adv , forming a confrontation between the two.
The antagonistic similarity loss L Adv can measure the difference between the public representations generated by different modalities.When the value L Adv is small, the modality discriminator can better judge the source modality of the representation.Therefore, the similarity of the public representations obtained by different modalities is improved by minimizing L Adv .

Refactoring Constraints
To ensure the integrity and validity of each piece of modality information during the encoding process, reconstruction constraints are introduced.Specifically, the specific representation generated by the same modality and the common representation are first input into the modality-common decoder network to obtain the reconstruction vector, xm , of the mode feature: where D s (•) represents the modality-common decoder, the learnable network parameters θ s .Without a loss of generality, the reconstructed vector xm should be as similar as possible to the original modality features, x m , to ensure that the encoded representation retains the effective information in the original modality features to the greatest extent.The mean squared error with constant proportions was used.To measure the difference between x m and xm , the calculation formula of reconstruction loss is as follows: where ∥•∥ 2 is the L2 normalization, k is the number x m of elements contained, and 1 k is a one vector of length k.The model is minimized, L Rec , to ensure the integrity of the modality feature information during the encoding process.

Attention-Driven Double-Layer Graph Convolutional Network Module
Considering the globality and locality of label category correlation, this paper designs attention-driven two-layer convolutional networks to adaptively learn the correlation matrix to mine the dependencies between labels from the global perspective and samplespecific local perspective, respectively, for enhanced modality representation.In addition, the attention mechanism is introduced in the GCN to mine the correlation structure in the feature dimension and the label category dimension of specific samples and further enhance the modality representation.Otherwise, residual connections are added between the two-layer graph convolutions to prevent network degradation.
Since both the modality-common representation z c and the modality-specific representation z s m need to be processed through the bilayer GCN without a loss of generality, the reference modality-common representation z ∈ R d and the modality-specific representation d are used as the number of dimensions of the modality representation.

Static Graph Convolutional Network
The first layer of an attention-driven bilayer graph convolutional network is a static graph convolutional network, its structure is presented in Figure 3.The modality repre-sentation Z is first extended by a row to obtain the initial category representation matrix V ∈ R C×d , where the jth(j = 1, 2, . . ., C) row of the matrix represents the class representation of the sample specific to the jth label.Accordingly, a global-based static graph is constructed, which has C nodes.The initial category representation matrix V is the node feature matrix of the graph, and its correlation matrix V characterization is based on the label dependence of the whole training dataset.Enter the static GCN to get the intermediate output H, which can be described by the formula: where, LeakyReLU(•) is the nonlinear activation function, A s ∈ R C×C is the correlation matrix of the static GCN, and W s ∈ R d×d 1 is the state update matrix of the static GCN, characterizing the linear transformation from dimension d to dimension d 1 .Both are randomly initialized and updated by gradient descent during training.A s is shared by all the training samples in the dataset and therefore able to capture global label correlation information.

Static Graph Convolutional Network
The first layer of an attention-driven bilayer graph convolutional network is a static graph convolutional network, its structure is presented in Figure 3.The modality representation Z is first extended by a row to obtain the initial category representation matrix , where the jth(j = 1, 2, ..., C ) row of the matrix represents the class represen- tation of the sample specific to the j th label.Accordingly, a global-based static graph is constructed, which has C nodes.The initial category representation matrix V is the node feature matrix of the graph, and its correlation matrix V characterization is based on the label dependence of the whole training dataset.Enter the static GCN to get the intermediate output H , which can be described by the formula:

Dynamic Graph Convolutional Network Based on Attention Perception
The second layer of the attention-driven dynamic graph convolutional network is based on attentional perception, introducing attention mechanism dynamic capture specific to the sample characteristics of the dimension correlation structure and label category dimension correlation structure.The static GCN intermediate output H is further enhanced to obtain the enhanced category Z.The structure of dynamic GCN is presented in Figure 3.
Inspired by [31], the model introduces the attention mechanism to mine the correlation structure of the sample itself and dynamically generates the category correlation matrix A d and feature correlation matrix F for specific samples.Firstly, the attention score calculation formula based on the scaling point product is given as follows: where softmax is the nonlinear activation function, Q and K represents the query matrix and the key matrix, respectively, and d k is the scale factor; its value should be the same as the number of dimensions of the key matrix K.
The calculation formula for defining the feature correlation matrix F is as follows: where the scaling factors of the intermediate output into a mapping matrix of the query and bond matrix.
The attention score matrix calculated by Formula ( 10) is used as the feature correlation matrix F to characterize the correlation structure of the sample feature dimensions.Then, a specific sample-based dynamic graph is constructed, whose node feature matrix is the intermediate output H ′ of the feature correlation matrix F enhancement, and the attention mechanism is introduced to dynamically generate its correlation matrix A d , to characterize the label dependence based on the specific sample.Dynamic GCN takes H ′ as input and outputs further enhanced category representation Z, which can be formally defined as follows: which is the correlation matrix, A d ∈ R C×C , of the dynamic GCN, the state update matrix In short, the proposed attention-driven bilayer graph convolutional network works by introducing label category correlation information and feature correlation information.

Modality Representation Fusion with Multi-Label Classification
As shown in Figure 1, the modality-specific representation z s m and modality-common representation z c are input into the attention-driven bilayer convolutional network to get the corresponding enhanced category representation matrix Z c , Z v , Z t and Z a , and then weighted to get the final category representation: where the Hadamard product ⊙, namely, the corresponding elements in the matrix W v , W t , W a , and W c ∈ R C×d 2 , are multiplied separately, is the fusion weight matrix of adaptive learning, representing the contribution degree of each modality representation and the modality-common representation in the corresponding label category.Then, the category vector specific to each label in the final category representation matrix Z ′ = [z 1 , z 2 , . . ., z C ] is put into the corresponding binary classifier z j (j = 1, 2, . . ., C) to predict the category score s = [s 1 , s 2 , . . ., s C ], and the prediction score of each category is obtained.Thus, the calculation formula for classification loss is as follows: where it is the sigmoid activation function δ(•), in which y i j is the true label i of the first sample.Taking 1 as meaning that the sample has the class label j, and taking 0 as the opposite, s i j is the prediction result of the label of the network model.In conclusion, the combination of Formulas (4), ( 5), (7), and ( 13) can obtain the overall loss function of the proposed BGCLN model as follows: where α, β, and γ are the trade-off parameters balancing different loss contributions, and the pseudo-code of the training process of the proposed BGCLN model is shown in Algorithm 1.

Algorithm 1 Model training process-BGCLN training process
Data input: {x v , x t , x a }: visual features, trajectory features, and acoustic features of the micro-video; y: the real category label vector of the micro-video; α = 0.1, β = 0.05, γ = 0.05: the term coefficient of the loss function; 1: Randomly initialize all network parameters; 2: Repeat; 3: For i = 1, 2, L, epoch do; 4: Use Formula (1) to calculate the specific representation z s m of each mode; 5: Use Formulas ( 2) and (3) to calculate the public representation z c of each mode; 6: Use Formula (6) to calculate the reconstructed vector xm ; 7: Use Formula (8) to update the category to represent V the intermediate output H; 8: Use Formula (10) to calculate the feature correlation matrix H according to the intermediate output F; 9: Calculate the enhanced category representation V ′ using Formula (11); 10: Use Formula (12) to calculate the fused category representation Z; 11: Update all network parameters using the stochastic gradient descent method under Formula (14); 12: End for; 13: Until convergence.Data output: all network training parameters θ m , θ c , θ d , θ s , etc.

Dataset and Experimental Settings
To facilitate the development of tracking algorithms well-suited for underwater environments and address the lack of existing underwater visual datasets, Panetta et al. proposed the first comprehensive underwater object tracking (UOT100) benchmark dataset [32,33].This dataset consists of 104 underwater video sequences and over 74,000 annotated frames, which are derived from both natural and artificial underwater videos, with a variety of distortions.The UOT100 dataset accessed at the following URL: https://www.kaggle.com/datasets/landrykezebou/uot100-underwater-object-tracking-dataset, accessed on 23 January 2024.
Given the scarcity of multi-label classification datasets for underwater micro-videos [22], this paper combines the UOT100 dataset and relevant underwater content from the MLSV2018 dataset; by processing and relabeling these micro-videos, an underwater micro-video multilabel classification dataset (UVMCD) was established, as shown in Figure 4.The MLSV2018 dataset accessed on 20 January 2024 at the following URL: https://github.com/tjufan/challengerai-mlsv2018.All the experiments described in this paper were conducted on the UVMCD, a large-scale multi-label classification dataset for micro-video, in order to verify BGCLN.
The multi-label classification dataset of underwater micro-videos is composed of 3841 underwater micro-videos and their corresponding audios, and each micro-video has a corresponding category label.The dataset has 19 label categories, with 1-5 labels per micro-video.The number and proportion of labels in the multi-label classification dataset of underwater micro-videos are shown in Table 1.The label distribution of the underwater micro-video multi-label classification dataset is shown in Figure 5.In our experiment, 80% of the data is used to train, and the remaining 20% is used to evaluate.

Performance Evaluation
The performance of the algorithm is evaluated by five evaluation indexes [34]: mean average precision (mAP), Hamming loss, ranking loss, coverage, and one error.

Convergence Analysis
To analyze the convergence of the models in this chapter, the experimental results of the average precision versus the number of model iterations and the classification loss versus the number of model iterations were tested, which are shown in Figures 6a and 6b, respectively.From the figures, it can be observed that the average precision increases as the number of iterations increases and stabilizes at the optimal average precision when the number of iterations is 40.Meanwhile, the classification loss decreases as the number of iterations increases and stabilizes eventually.

Ablation Experiments and Analysis
Ablation experiments, including model performance evaluation with different modes and different module configurations, were conducted.Table 2 shows the performance comparison of the different schemes on the classification results.In Table 2, ↑ indicates that a higher value of the metric corresponds to better model performance, ↓ indicates that a lower value of the metric corresponds to better model performance.It can be seen that the multi-modal fusion method outperforms the unimodal method, which indicates that the complementarity between different modalities can be effectively utilized by integrating the multi-modal fusion to promote compatibility modeling.Meanwhile, each individual modality can have a different degree of positive impact on the model performance.The audio modality is superior to the visual modality, which means that the audio modality contains more valuable information about under-

Ablation Experiments and Analysis
Ablation experiments, including model performance evaluation with different modes and different module configurations, were conducted.Table 2 shows the performance comparison of the different schemes on the classification results.In Table 2, ↑ indicates that a higher value of the metric corresponds to better model performance, ↓ indicates that a lower value of the metric corresponds to better model performance.It can be seen that the multi-modal fusion method outperforms the unimodal method, which indicates that the complementarity between different modalities can be effectively utilized by integrating the multi-modal fusion to promote compatibility modeling.Meanwhile, each individual modality can have a different degree of positive impact on the model performance.The audio modality is superior to the visual modality, which means that the audio modality contains more valuable information about underwater categories than the visual modality.Whether in the unimodal or multi-modal case, the graph associative learning module plays an important role in the framework proposed in this chapter, which proves the necessity of the graph associative learning module by learning the semantic association representation between labels to better complete the multi-label classification task for underwater micro-videos.

Model Validity Analysis
To illustrate the validity of the methods presented in this chapter, comparisons are made with the following different types of methods and all the experiments using the same training and test sets.Table 3 compares the performance of the different methods on the UVMCD dataset.In Table 3, it can be observed that deep representation-based methods, namely GoogleNet and C3D, typically model only a single modality and lack semantic correlation modules, and are also sensitive to lighting, color cast, and water flow disturbances in underwater micro-videos, resulting in unsatisfactory performance.In addition, the results of the four multi-label learning methods, MLKNN, GLOCAL, SIMM, and TM3L, are not satisfactory, but they perform better than the baseline methods in the coverage metric, which indicates that multi-label learning methods can better handle the correlation between labels and have lower complexity compared to multi-modal semantic enhancement methods.Finally, the multi-label classification method based on multi-modal semantic enhancement achieved relatively promising results, reflecting the positive role of multi-modal fusion semantic enhancement in multi-label classification tasks.

Summary
In this paper, a bilayer graph convolution learning network based on a constrained codec (BGCLN) is proposed.First, considering the consistency and complementarity of multi-modal information, learning modality-specific and modality-common representations through codec networks with constraints is constructed.Secondly, the correlation information between the labels from both global and local perspectives is mined through the attention-driven bilayer graph convolutional network, while introducing the attention mechanism to explore the correlation structure of the samples in the label dimension and feature dimensions in the graph convolution to enhance the modality representation.Finally, the enhanced public representation and each modality-specific representation are weighted, fused, and input into the classifier to complete the multi-label classification task.Based on the large-scale multi-label micro-video dataset UVMCD, a series of experiments show that the proposed BGCLN model has better achievement in parameter sensitivity analysis, module ablation analysis, modality combination analysis, and other aspects.Meanwhile, compared with other models, BGCLN has better classification performance to verify its effectiveness.

( 1 )
Modality-specific representation and modality-common representation learning modules: The modality-common representation learns through adversarial training; the orthogonal constraint separates the common information and specific information of the modality features to reduce the redundancy between the learned representations.The reconstruction constraint preserves the effective information in the original modality features as much as possible.(2)Attention-driven double-layer graph convolution network module: A two-layer graph convolutional network (GCN) correlates mined label information between the global and local perspectives and introduces the attention mechanism in the second GCN mining sample with characteristic and label category dimensions to enhance the modality representation.(3) Modality representation fusion and multi-label classification module: Take the weighted fusion of the enhanced modality-common representation and each modality-specific representation as the final micro-video representation, and the fusion weight is adaptively learned by the model.The resulting representation is then input into the classifier to obtain the category prediction score.

Electronics 2024 , 16 Figure 1 .
Figure 1.Framework structure diagram of bilayer graph convolution learning network based on a constrained codec.

θFigure 1 .
Figure 1.Framework structure diagram of bilayer graph convolution learning network based on a constrained codec.
Electronics 2024, 13, x FOR PEER REVIEW 6 of 16 so the public representation learned from different modalities should be consistent in the model training process, which can take the public representation generated by the modality average as the final modality-common representation: the modality uses the features of the three modalities as input.

Figure 2 .
Figure 2. Framework structure diagram of modality-specific representation and modality-common representation learning modules.

Figure 2 .
Figure 2. Framework structure diagram of modality-specific representation and modality-common representation learning modules.
update matrix of the static GCN, characterizing the linear transformation from dimension d to dimension 1 d .Both are randomly initialized and updated by gradient descent during training.s A is shared by all the training samples in the dataset and therefore able to capture global label correlation information.

Figure 3 .
Figure 3. Framework structure diagram of attention-driven double-layer graph convolutional network module.

Figure 3 .
Figure 3. Framework structure diagram of attention-driven double-layer graph convolutional network module.

Figure 6 .
Figure 6.(a) The curve of the mean average precision changes with the epoch.(b) The classification loss curve changes with the epoch.

Figure 6 .
Figure 6.(a) The curve of the mean average precision changes with the epoch.(b) The classification loss curve changes with the epoch.
and the scaling factor d k2 = C. W 1 ∈ R C×C and W 2 ∈ R C×C are the mapping matrices that transform the intermediate output H ′ into a query matrix and a key matrix.A d dynamically constructs based on the current input sample, which can better capture label dependency relationships specific to the current sample.H ′ represents the intermediate output, H, of the dynamically adjusted feature correlation matrix F, which enhances H by introducing the correlation information of samples in the feature dimension by F. U 3 ∈ R C×C is a linear transformation matrix.Finally, residual connections are added between the static GCN and the dynamic GCN to prevent network degradation.

Table 2 .
Performance comparison of different schemes on classification results.

Table 2 .
Performance comparison of different schemes on classification results.