Dense Multi-Scale Graph Convolutional Network for Knee Joint Cartilage Segmentation

In this paper, we propose a dense multi-scale adaptive graph convolutional network (DMA-GCN) method for automatic segmentation of the knee joint cartilage from MR images. Under the multi-atlas setting, the suggested approach exhibits several novelties, as described in the following. First, our models integrate both local-level and global-level learning simultaneously. The local learning task aggregates spatial contextual information from aligned spatial neighborhoods of nodes, at multiple scales, while global learning explores pairwise affinities between nodes, located globally at different positions in the image. We propose two different structures of building models, whereby the local and global convolutional units are combined by following an alternating or a sequential manner. Secondly, based on the previous models, we develop the DMA-GCN network, by utilizing a densely connected architecture with residual skip connections. This is a deeper GCN structure, expanded over different block layers, thus being capable of providing more expressive node feature representations. Third, all units pertaining to the overall network are equipped with their individual adaptive graph learning mechanism, which allows the graph structures to be automatically learned during training. The proposed cartilage segmentation method is evaluated on the entire publicly available Osteoarthritis Initiative (OAI) cohort. To this end, we have devised a thorough experimental setup, with the goal of investigating the effect of several factors of our approach on the classification rates. Furthermore, we present exhaustive comparative results, considering traditional existing methods, six deep learning segmentation methods, and seven graph-based convolution methods, including the currently most representative models from this field. The obtained results demonstrate that the DMA-GCN outperforms all competing methods across all evaluation measures, providing DSC=95.71% and DSC=94.02% for the segmentation of femoral and tibial cartilage, respectively.


Introduction
Osteoarthritis (OA) is one of the most prevalent joint diseases worldwide, causing pain and mobility issues, reducing the ability to lead an independent lifestyle, and ultimately decreasing the quality of life in patients.It primarily manifests among populations of advanced age, with an estimated 10% of people over the age of 55 dealing with this condition.That percentage is likely to noticeably increase in the coming years, especially in the developing parts of the world where life expectancy is steadily on the rise [1].
Among the available imaging modalities, magnetic resonance imaging (MRI) constitutes a valuable tool in the characterization of the knee joint, providing a robust quantitative and qualitative analysis for detecting anatomical changes and defects in the cartilage tissue.Unfortunately, manual delineations performed by human experts are resource-and time-consuming, while also suffering from unacceptable levels of inter-and intrarater variability.Thus, there is an increasing demand for accurate and time-efficient fully automated methods for achieving reliable segmentation results.
During the past decades, a considerable amount of research has been conducted to achieve the above goals.However, the thin cartilage structure, as well as the great variability in and intensity inhomogeneity of MR images have posed significant challenges.Several methods are proposed to address those issues, ranging from more traditional image processing ones such as statistical shape models and active appearance models, to more automated ones employing classical machine learning and deep learning techniques.A comprehensive review of such automated methods for knee articular segmentation can be found in [2].

Statistical Shape Methods
A wide variety of methods that fall under the statistical shape model (SSM) and active appearance model (AAM) family have been extensively employed in knee joint segmentation applications in the past.Since the specific shape of the cartilage structure is quite distinct and characteristic, these methods employ this feature as a stepping stone towards complete delineation for the whole knee joint [3].Additionally, SSMs have been successfully utilized as shape regularizers within more complex segmentation pipelines, mainly as a final postprocessing step [4].While conceptually simple, these methods are highly sensitive to the initial landmark selection process.

Machine Learning Methods
Under the classical machine learning setting, knee cartilage segmentation is cast as a supervised classification task, estimating the label of each voxel from a set of handcrafted or automatically extracted features from the available set of images.Typical examples of such approaches can be found in [5,6].These methods are conceptually simple but usually offer mediocre results, due to their poor generalization capabilities and the utilization of fixed feature descriptors that may not be well suited to efficiently capture the data variability.

Multi-Atlas Patch-Based Segmentation Methods
Multi-atlas patch-based methods have long been a staple in medical imaging applications [7,8].Utilizing an atlas library A = {A i , L i } n A i=1 comprising n A magnetic resonance images (A i ) and their corresponding label maps L i , these methods operate on a single target image T at a time, annotating it by propagating voxel labels from the atlas library A. The implicit assumption in this framework is that the target image T and the corresponding images comprising the atlas library A reside in a common coordinate space.This assumption is enforced by registering all atlases A i ∈ A along with their corresponding labels maps L i to the target image space, via an affine or deformable transformation [9].
Multi-atlas patch-based methods usually consist of the following steps: For all voxels x ∈ T , a search volume N(x) of size N s = (n × n × n) centered around x is formed, and every corresponding voxel y ∈ N(x) in the spatially adjacent locations in the registered atlases yields a patch library P L = p A i (y), ∀y ∈ N(x) for all atlases i = 1, • • • , n A .An optimization problem such as sparse coding (SC) is then used to reconstruct the target patch as a linear combination of its corresponding atlas library.Established methods of this category of segmentation algorithms are presented in [7,8].Despite being capable of achieving appreciable results, these methods do not scale well to large datasets due to the intense computational demands of constructing a patch library and solving an optimization problem for each voxel of every target image.

Deep Learning Methods
The recent resurgence of deep learning has had a great impact on medical imaging applications, with an increasing number of works reporting the use of deep architectures in various applications pertaining to that field.Initially restricted to 2D models due to the large computational load imposed by the 3D structure of magnetic resonance images, the recent advancements in processing power have allowed for a wide variety of fully 3D models to be proposed, offering markedly better performance with respect to the more traditional methods [10,11].In our study, we compare our proposed method against a series of representative deep architectures, with applications ranging from semantic segmentation to point-cloud classification and medical image segmentation.In particular, we consider the SegNet [12], DenseVoxNet [13], VoxResNet [14], PointNet [15], CAN3D [13], and KCB-Net [16] architectures (Section 9.4.2).

Graph Convolutional Neural Networks
Recently, intensive research has been conducted in the field of graph convolutional networks, owing to their efficiency in handling non-Euclidean data [17].These models can be distinguished into two general categories, namely, the spectral-based [18,19] and the spatial-based methods [20][21][22][23][24].The spectral-based networks rely on the graph signal processing principles, utilizing filters to define the node convolutions.The ChebNet in [18] approximates the convolutional filters by Chebyshev polynomials of the diagonal matrix of eigenvalues while the GCN model in [19] performs a first-order approximation of ChebNet.
Spatial-based graph convolutions, on the other hand, update a central node's representation by aggregating the representations of its neighboring nodes.The message-passing neural network (MPNN) [20] considers graph convolutions as a message-passing process, whereby information is traversed between nodes via the graph edges.GraphSAGE [25] applies node sampling to obtain a fixed number of neighbors for each node's aggregation.A graph attention network (GAT) [24] assumes that the contribution of the neighboring nodes to the central one is determined according to a relative weight of importance, a task achieved via a shared attention mechanism across nodes with learnable parameters.
During the last years, GCNs have found extensive use in a diverse range of applications, including citation and social networks [19,26], graph-to-sequence learning tasks in natural language processing [27], molecular/compound graphs [28], and action recognition [29].Considerable research has been conducted on the classification of remotely sensed hyperspectral images [30][31][32], mainly due to the capabilities of GCNs to capture both the spatial contextual information of pixels, as well as the long-range relationships of distant pixels in the image.Another domain of application is the forecasting of traffic features in smart transportation networks [33,34].To capture the varying spatio-temporal relationships between nodes, integrated models are developed in these works, which combine graph-based spatial convolutions with temporal convolutions.
Finally, to confront the gradient vanishing effect faced by traditional graph-based models, deep GCN networks have recently been suggested [35,36].Particularly, in [35], a densely connected graph convolutional network (DCGCN) is proposed for graph-tosequence learning, which can capture both local and nonlocal features.In addition, Ref. [36] presents a densely connected block of GCN layers, which is used to generate effective shape descriptors from 3D meshes of images.

Outline of Proposed Method
The existing patch-based methods exhibit several drawbacks which can potentially degrade their segmentation performance.First, for each target voxel, these methods construct a local patch library at a specific spatial scale, comprising neighboring voxels from atlas images.Then, classifiers are developed by considering pairwise similarities between voxels in that local region.This suggests that target labeling is accomplished by relying solely on local learning while disregarding the global contextual information among pixels.Hence, long-range relationships among distant voxels in the region of interest are ignored, although these voxels may belong to the same class but with a different textural appearance.Secondly, previous methods in the field resort to inductive learning to produce voxel segmentation, which implies that the features of the unlabeled target voxels are not leveraged during the labeling process.Finally, some recent segmentation methods employ graph-based approaches allowing a more effective description of voxel pairwise affinities via sparse code reconstructions [8,37].Despite the better data representation, the target voxel labeling is achieved using linear aggregation rules for transferring the atlas voxel's labels, such as the traditional label propagation (LP) mechanism via the graph edges.Such first order methods may fail to adequately capture the full scope of dependencies among the voxel representations.The labeling of each voxel proceeds by aggregating spectral information strictly from its immediate neighborhood, failing to exploit long-term dependencies with potentially more similar patches in distant regions of the image, thus ultimately yielding suboptimal segmentation results.
To properly address the above shortcomings, in this paper we present a novel method for the automatic segmentation of knee articular cartilage, based on recent advances in the field of graph-based neural networks.More concretely, we propose the dense multiscale adaptive graph convolutional network (DMA-GCN) method, which constructively integrates local spatial-level learning and global-level contextual learning concurrently.Our goal is to generate, via automatic convolutional learning, expressive node representations by merging pairwise importance at multiple spatial scales with long-range dependencies among nodes for enhanced volume segmentation.We approach the segmentation task as a multi-class classification problem with the five classes: background: 0; femoral bone: 1; femoral cartilage: 2; tibial bone: 3; tibial cartilage: 4. Recognizing the more crucial role of the cartilage structure in the assessment of the knee joint and considering the increased difficulty for its automatic segmentation as contrasted with that of bones, our efforts are primarily devoted to that issue.Figure 1 depicts a schematic framework of the proposed approach.The main properties and innovations of the DMA-GCN model are described as follows: Multi-atlas setting: Our scheme is tailored to the multi-atlas approach, whereby label information from atlas images (labeled) is transferred to segment the target image (unlabeled).To this end, at the preliminary stage, images are aligned using a cost-effective affine registration.Subsequently, for each target image T , we generate its corresponding atlas library {A, L} according to a similarity criterion (Figure 1a).
Graph construction: This part refers to the way in which images are represented in terms of nodes and the organization of node data to construct the overall graph (Figure 1b).Here, the graph node corresponds to a generic patch of size 5 × 5 × 5 around a central voxel, while the node feature vector is provided by a 3D-HOG feature descriptor.Accordingly, the image is represented as a collection of spatially stratified nodes, covering adequately all classes across the region of interest.Following the multi-atlas setting, we construct sequences of aligned data, comprising target nodes and those for the atlases at spatially correspondent locations.Given the node sequences, we further generate the sequence libraries which are composed of neighboring nodes at various spatial scales.The collection of all node libraries forms the overall graph structure, whereby both local (spatially neighboring) and global (spatially distant) node relationships are incorporated.
Semi-supervised learning (SSL): Following the SSL scenario, the input graph data comprise both labeled nodes from the atlas library and unlabeled ones from the target image to be segmented.In that respect, contrary to some existing methods, the features of unlabeled data are leveraged via learning to compute the node embbeddings and label the target nodes.
Local-global learning: As can be seen from Figure 1c, graph convolutions over the layers proceed along two directions, namely, the local spatial level and the global level, respectively.The local spatial branch includes the so-called local convolutional (Lconv) units which operate on the subgraph of aligned neighborhoods of nodes (sequence libraries).The node embeddings generated by these units incorporate the contextual information between nodes at a local spatial level.To further improve local search, we integrate local convolutions at multiple scales, so that the local context around nodes is captured more efficiently.On the other hand, the global branch includes global convolution (Gconv) units.These units provide the global node embeddings by taking into consideration the pairwise affinities of distant nodes distributed over the entire region of interest of the cartilage volume.The final node representations are then obtained by aggregating the embeddings computed at the local spatial and the global levels, respectively.
Convolutional building models: An important issue is how the local and global hidden representations of nodes are combined across the convolution layers.In this context, we propose two different structures: the cross-talk building model (CT-BM) and the sequential building model (SEQ-BM).Both models comprise four convolutional units overall, specifically, two Lconv and two Gconv units, undertaking local and global convolutions, respectively.The CT-BM (Figure 1c) performs intertwined local-global learning, with skip connections and aggregators.The links indicate the cross-talks between the two paths.The SEQ-BM, on the other hand, adopts a sequential learning scheme.In particular, local spatial learning is completed first, followed by the respective convolutions at the global level.
Adaptive graph learning (AGL): Considering fixed graphs with predetermined adjacency weights among nodes can degrade the segmentation results.To confront this drawback, every Lconv and Gconv unit is equipped here with an AGL mechanism, which allows us to automatically learn the proper graph structure at each layer.At the local spatial level, AGL adaptively designates the connectivity relationships between nodes via learnable attention coefficients.Hence, Lconv can concentrate and aggregate features from relevant nodes in the local search region.Further, at the global level we propose a different AGL scheme for Gconv units, whereby graph edges are learned from the input features of each layer.
Densely connected GCN: The proposed CT-BM and SEQ-BM can be utilized as standalone models to undertake the graph convolution task.Nevertheless, their depth is confined to two local-global layers, since an attempt to deepen the networks is hindered by the gradient-vanishing effect.To circumvent this deficiency, we finally propose a densely connected convolutional network, the DMA-GCN model.The DMA-GCN considers CT-BM or SEQ-BM as the building block of the deep structure.It exhibits a deep architecture with skip connections whereby each layer in the block receives feature maps from all previous layers and transmits its outputs to all subsequent layers.Overall, the DMA-GCN shares some salient qualities, such as a deep structure with an enhanced performance rate and better information flow, local-global level convolutions, and adaptive graph learning.
In summary, the main contributions of this paper are described as follows.The remainder of this paper is organized as follows.Section 2 reviews some representative forms of graph convolutional networks related to our work and involved in the experimental analysis.Section 3 presents the image preprocessing steps and the atlas selection process.In Section 4, we discuss the node feature descriptor, as well as the graph construction of the images.Section 5 elaborates on the proposed local and global convolutional units, along with their attention mechanisms.Section 6 describes the suggested convolutional building blocks, while Section 7 presents our densely connected network.Section 8 discusses the transductive vs. inductive learning and the full-batch vs. mini-batch learning in our approach.In Sections 9 and 10, we provide the experimental setup and respective comparative results of the proposed methodology, while Section 11 concludes this study.

Related Work
In this section, we review some representative models in the field of graph convolutional networks that are related to our work and are also included in the experiments.

Definition 1.
A graph G is defined as G = (V, E , X) where V denotes the set of N nodes v i ∈ V, and E is the set of edges connecting the nodes (v i , v j ) ∈ E .X ∈ R N×F is a matrix subsuming the node feature descriptors x i ∈ R D , i = 1, . . .N with F denoting the feature vector dimensionality.The graph is associated with an adjacency matrix A ∈ R N×N (binary or weighted), which includes the connection links between nodes.A larger entry A ij > 0 suggests the existence of a strong relationship between nodes (v i , v j ), while A ij = 0 signifies the lack of connectivity.The graph Laplacian matrix L ∈ R (N×N) is defined as L = D − A, where D is the diagonal degree matrix with D ii = ∑ N j=1 A ij , i = 1, . . ., N. Finally, the normalized graph adjacency matrix with the added self-connections is denoted by A = A + I N , with the corresponding degree matrix given by The GCN proposed in [19] is a spectral convolutional model.It tackles the node classification task under the semi-supervised framework, i.e., where labels are available only for a portion of the nodes in the graph.Under this setting, learning is achieved by enforcing a graph Laplacian regularization term with the aim of smoothing the node labels: where L 0 represents the supervised loss measured on the labeled nodes of the graph, f (X, A) is a differentiable function implemented by a graph neural network, λ is a regularization term balancing the supervised loss in regard to the overall smoothness of the graph, X is the node feature matrix, and L is the graph Laplacian.
In a standard multilayer graph-based neural network framework, information flows across the nodes by applying the following layerwise propagation rule: where σ(•) denotes the LeakyReLU(•) activation function, W (l) is a layer-specific trainable weight matrix, and H (l) is the matrix of activation functions in the lth layer, with H (0) = X.The authors in [19] show that the propagation rule in Equation ( 3) provides a first-order approximation of localized spectral filters on graphs.Most importantly, we can construct multilayered graph convolution networks by stacking several convolutional layers of the form in Equation (3).For instance, a two-layered GCN can be represented by where Z is the network's output, so f tmax(•) is the output layer activation function for multi-class problems, and is the normalized adjacency matrix.The weight matrices W (0) and W (1) are trained using some variant of gradient descent with the aid of a loss function.

Graph Attention Network (GAT)
A salient component in GATs [24] is an attention mechanism incorporated in the aggregation of the graph attention layers (GALs), with the aim to automatically capture valuable relationships between neighboring nodes.Let the inputs and outputs of a GAL, where N is the number of nodes, while D and D are the corresponding dimensionalities of the node feature vectors.The convolution process entails three distinct issues: the shared node embeddings, the attention mechanism, and the update of node representations.As an initial step, a learnable transformation parameterized by the weight matrix W ∈ R D×D is applied on nodes, with the goal of producing expressive feature representations.Next, for every node pair, a shared attention mechanism is performed on the transformed features, where g ij signifies the importance between nodes h i and h j , α is a learnable weight vector, and ∥ denotes the concatenation operator.To make the above mechanism effective, the computation of the attention coefficients is confined between each node −i and its neighboring nodes −j, j ∈ N i .In the GAT framework, the attention mechanism is implemented by a single-layer feed-forward neural network, parameterized by LeakyReLU nonlinearities, which provide the normalized attention coefficients: Given the attention coefficients, the node feature representations at the output of the GAT are updated via a linear aggregation of neighboring nodes' features To stabilize the learning procedure, the previous approach is extended in a GAT by considering multiple attention heads.In that case, the node features are computed by where K is the number of independent attention heads applied, while α ij and W (k) denote the normalized attention coefficients associated with the kth attention head and its corresponding embedding matrix, respectively.In this work, we exploit the principles of the GAT-based attention mechanism in the proposed local convolutional units, with the goal to aggregate valuable contextual information from local neighborhoods, at multiple search scales (Section 5.1).

GraphSAGE
The GraphSAGE network in [25] tackles the inductive learning problem, where labels must be generated for previously unseen nodes, or even entirely new subgraphs.Graph-SAGE aims to learn a set of aggregator functions AGGR k , k = 1, . . ., K, which are used to aggregate information from each node's local neighborhood.Node aggregation is carried out at multiple spatial scales (hops).Among the different schemes proposed in [25], in our experiments, we consider the max-pooling aggregator, where each neighbor's vector is independently supplied to a fully connected neural network: where max(•) denotes the element-wise max operator, W pool is the weight matrix of learnable parameters, b is the bias vector, and σ(•) is a nonlinear activation function.Further, h k N (v) denotes the result obtained after a max-pooling aggregation on the neighboring nodes of node v. GraphSAGE then concatenates the current node's representation h k−1 v with the aggregated neighborhood feature vector h k−1 N (v) to compute, via a fully connected layer, the updated node feature representations: where W k is a weight matrix associated with aggregator AGGR pool k .

GraphSAINT
GraphSAINT [21] differs from the previously examined architectures in that instead of building a full GCN on all the available training data, it samples the training graph itself, creating subsets of the original graph, building and training the associated GCNs on those subgraphs.For each mini-batch sampled in this iterative process, a subgraph G s = (V s , E s ) (where |V s | ≪ |V|) is used to construct a GCN.Forward and backward propagation is performed, updating the node representations and the participating edge weights.An initial preprocessing step is required for the smooth operation of the process, whereby an appropriate probability of sampling must be assigned to each node and edge of the initial graph.

Materials
In this section, we present the dataset used in this study, the image preprocessing steps, and finally, the construction of the atlas library.

Image Dataset
The MR images used in this study comprise the entirety of the publicly available, baseline Osteoarthritis Initiative (OAI) repository, for which segmentation masks are available, consisting of a total of 507 subjects.The specific MRI modality utilized across all the experiments corresponds to the sagittal 3D dual-echo steady-state (3D-DESS) sequence with water excitation, with an image size of 384 × 384 × 384 voxels and a voxel size of 0.36 × 0.36 × 0.70 mm.The respective segmentation masks serving as the ground truth are provided by the publicly available repository assembled by [38], including labels for the following knee joint structures (classes): background tissue, femoral bone (FB), femoral cartilage (FC), tibial bone (TB), and tibial cartilage (TC).Figure 2 showcases a typical knee MRI, in the three standard orthogonal planes (sagittal, coronal, axial).

Image Preprocessing
The primary source of difficulties in automated cartilage segmentation stems from the similar texture and intensity profile of articular cartilage and background tissues, as they are depicted in most MRI modalities, a problem further accentuated by the usually high intersubject variability present in the imaging data.To this end, the images were preprocessed by applying the following steps: 1.
Curvature flow filtering: A denoising curvature flow filter [39] was applied, with the aim of smoothing the homogeneous image regions, while simultaneously leaving the surface boundaries intact.

2.
Inhomogeneity correction: N3 intensity nonuniform bias field correction [40] was performed on all images, dealing with the issue of intrasubject variability within similar classes among subjects.

3.
Intensity standardization: MRI histograms were mapped to a common template, as described in [41], ensuring that all associated structures across the subjects shared a similar intensity profile.

4.
Nonlocal-means filtering: A final filtering process smoothed out any leftover artefacts and further reduced noise.The method presented in [42] offers a robust performance and is widely employed in similar medical imaging applications.Finally, the intensity range of all images was rescaled to [0, 100].

Atlas Selection and ROI Extraction
The construction of an atlas library for each target image T to be segmented necessitates the registration of all atlases {A i } n A i=1 to the particular target image.An affine transformation was employed, registering all atlas images in the target image domain space, accounting for deformations of linear nature, such as rotations, translations, shear-ing, scaling, etc.The same transformation was also applied to the corresponding label map L i of each atlas, resulting in the atlas library {A T i , L T i } N A i=1 registered to T .Considering the fact that the cartilage volume accounts for a very small percentage of the overall image volume, a region of interest (ROI) was defined for every target image, covering the entire cartilage structure and its surrounding area.A presegmentation mask was constructed by passing the registered atlas cartilage mask through a majority voting (MV) filter, and then expanded by a binary morphological dilation filter, yielding the ROI for the target image.This region corresponded to the sampling volume for the target image T and its corresponding atlas library {A i , L i } N A i=1 .This process guaranteed that the selected ROI enclosed the totality of cartilage tissue both in the target T , as well as in the corresponding atlas library.
Finally, to simultaneously reduce the computational load and increase the spatial correspondence between target and atlas images, we included a final atlas selection step.Measuring the spatial misalignment in the ROI of every pair {T , A i } T i=1 using the mean squared difference (MSD ROI i ), we only kept the first N A atlases exhibiting the least disagreement in the metric [37].

Graph Constructions
In this section, we describe the node representation, the construction of aligned sequences of nodes, and the sequence libraries, which lead to the formation of the aligned image graphs used in the convolutions.

Node Representation
An important issue to properly address is how an image is transformed to a graph structure of nodes.In our setting, a node was described by a generic 5 × 5 × 5 patch p i = p(x i ) surrounding a central voxel x i .The image was then represented by a collection of nodes which were spatially distributed across the ROI volume.
Each node was described by a feature vector x i = f enc (p i ) ∈ R 20 implemented via HOG descriptors [43], which aggregated the local information on the node patch.HOG descriptors constitute a staple feature descriptor in image processing and recognition.Here, we applied a modification suitable for operating on 3D data [44].For each voxel x i , we extracted an HOG feature description by computing the gradient magnitude and direction along the x − y − z axes for each constituent voxel in the node patch.The resulting values were binned to a (q = 20)-dimensional feature vector where each entry corresponded to the vertex of a regular icosahedron, with each bin representing the strength of the gradient along that particular direction.
Finally, each node was associated with a class indicator vector y i = [y i 1 , . . ., y i,c ] ∈ R c , where y i,c = 1 if voxel x i belongs in class i and 0 otherwise.

Aligned Image Graphs
The underlying principle of the multi-atlas approach is that the target image T and the atlas library A T i are aligned via affine registration, thus sharing a common coordinate space.This allows the transfer of label information from atlas images towards the target one by operating upon sequences of spatially correspondent voxels.Complying with the multiatlas setting, we applied a two-stage sampling process with the aim to construct a sequence of aligned graphs, involving the target image and its respective atlases.This sequence contained the so-called root nodes which were distinguished from the neighboring nodes introduced in the sequel.

1.
Target graph construction: This step used a spatially stratified sampling method to generate an initial set of target voxels X T r ∈ R n T ×D , where D denotes the feature dimensionality.To ensure a uniform spatial covering of all classes in the target ROI, we performed a spatial clustering step partitioning all contained voxels into n T r clusters.After interpolating the cluster centers to the nearest grid point, we obtained the global dataset X T r = {x T r (i), i = 1, . . ., n T r }, which defined a corresponding target graph of root nodes G T r .These target nodes served as reference points from which the aligned sequences were subsequently generated.

2.
Sequences of aligned data: For each x T r (i) ∈ X T r , we defined a sequence of aligned nodes S(i), containing the target node x T r (i) and its respective nodes from the atlas library, located at spatially correspondent positions: The entire global dataset of root nodes, containing all sampled target nodes and their associated atlas ones, was defined as the union of all those sequences where X r ∈ R(N r × D) contains a total number of N r = n T • (n A+1 ) root nodes, while X T r and X A r denote the datasets of root nodes sampled from the target and atlases, respectively.Accordingly, this led to a sequence of aligned graphs }, which is schematically shown in Figure 3.In this figure, we can distinguish two modes of pairwise relationships among root nodes that should be explored.Concretely, there are local spatial affinities across the horizontal axis between nodes belonging to a specific node sequence.On the other hand, there also exist global pairwise affinities between nodes of each image individually, as well as between nodes belonging to different images in the sequence.The latter type of search ensures that nodes of the same class located at different positions in the ROI volume and with different textural appearance are taken into consideration, thus leading to the extraction of more expressive node representations of the classes via learning.

Target image
Atlas images

Sequence Libraries
It should be stressed that the cost-effective affine registration used in our method is not capable of coping with severe image deformations.Hence, it cannot provide sufficiently accurate alignment between the target and the atlases.To account for this deficiency, we expanded the domain of local search by considering neighborhoods around nodes.Specifically, for each node x i , we defined multihop neighborhoods at multiple scales: for s = 1, . . ., S, where R s (x i ) denotes the neighborhood at scale s, and S is the number of scales used.R 0 (x i ) = x i corresponds to the basic patch 5 × 5 × 5 of the node itself.R 1 (x i ) and R 2 (x i ) are the 1-hop and 2-hop neighborhoods delineated as 9 × 9 × 9 and 13 × 13 × 13 volumes around x i , respectively.In our experiments, we considered two different spatial scales (S = 2).Figure 4 illustrates the different node neighborhoods.Next, for each sequence S(i we created the corresponding sequence libraries by incorporating the local neighborhoods of all root nodes belonging to that sequence.The sequence library SL s (i) at scales s = 0, 1, • • • , S was defined by SL s (i) contains (n A + 1) • |R s | nodes, where |R s | denotes the size of the spatial neighborhood at scale s. Figure 5 provides a schematic illustration of a sequence library.The collection of all SL s forms a global dataset X s of aligned neighborhoods described as follows:

Target image
for s = 1, . . ., S. X s is formed as union of the dataset X r of root nodes and the dataset N s comprising their neighboring nodes at scale s.
The corresponding graph G comprises an overall total number of N = |N r | + |N S | nodes, including the root and neighboring nodes.

Convolutional Units
This section elaborates on the basic convolutional units, namely, the local convolutional unit Lconv and the global convolutional unit Gconv which serve as structural elements to devise our proposed models (Figure 6).

Local Convolutional Unit
The local convolutional unit undertakes the local spatial learning task, operating horizontally along the sequence libraries (SL s ) of nodes.Instead of confining ourselves to predefined and fixed weights in the graphs, we opted to apply a local attention mechanism to adaptively learn the graph structure information at each layer.Specifically, we used the attention approach suggested in the GAT as a means to capture the local contextual relationships among nodes in the search area.
A functional outline of Lconv at layer l is shown in Figure 7.It receives an input in from the previous layer and provides its output o denote the dimensionalities of the input and output node features, respectively.
Figure 7 provides a detailed architecture of Lconv.The model involves S sub-modules, each one associated with a specific spatial scale of aggregation.The sub-module s acts upon the subgraph G s (V s , E s ), s = 0, . . ., S, which subsumes the sequence libraries SL s of root nodes.Its input is in and after a local convolution at scale s, it provides its own output o .In this context, the structure of G s is adapted to the local attention mechanism.Let us assume that a root node x i belongs to the qth sequence library: x i ∈ SL s (q), q = 1, . . ., n A .Then, node x i pays attention to two pools of neighboring nodes (Figure 3): (a) it aggregates relevant feature information from nodes x j of its own neighborhood, x j ∈ R s (x i ) (self-neighborhood attention); (b) it aggregates features of nodes belonging to the other aligned neighborhoods in SL s (q) pertaining to the atlas images: x j ∈ SL s (q) \ R s (x i ).For these pairs of nodes, we compute normalized attention coefficients using Equation (6).Further, pairwise affinities between nodes belonging to different sequence libraries are disregarded, i.e., α ij = 0 when x i ∈ SL s (q) and x j ∈ SL s (p), p ̸ = q.It should be noticed that we are primarily focused on computing comprehensive feature representations of the root nodes.Nevertheless, neighboring nodes are also updated; however, in this case, the attention is confined to the neighborhood of the root node it belongs to.For convenience, let us consider the input node features of the form . The local-level convolution at scale s of a root node x i ∈ SL s (q), i = 1, . . ., N r is obtained by: q (l) The first term in the above equation refers to the self-neighborhood attention, which aggregates node features from R s (x i ) within the same image.Moreover, the second term aggregates node information from the other aligned neighborhoods in the sequence.In an attempt to stabilize the learning process and further enhance the local feature representations, we followed a multi-head approach, whereby K independent attention mechanisms are applied.Accordingly, the node convolutions proceed as follows: where α (k) ij denote the attention coefficients between nodes x i and x j according to the kth attention head, while are the corresponding parameter weights used for node embeddings.The attention parameters are shared across all nodes in G s (V s , E s ) and are simultaneously learned at each layer l and for each spatial scale, individually.The outputs of the different sub-modules are finally aggregated to yield the overall output of the Lconv unit: where ⊕ denotes the concatenation operator.
The multilevel attention-based aggregation of valuable contextual information from sequence libraries offers some noticeable assets to our approach: (a) it acquires comprehensive node representations which assist in producing better segmentation results, (b) the graph learning circumvents the inaccuracies caused by affine registration in severe image deformations which may lead to node misclassification.

Global Convolutional Unit
The global convolutional unit conducts the global convolution task; it acts upon the subgraph G r (V r , E r ) which includes the sequence of root nodes S(i).Gconv aims at exploring the global contextual relationships among nodes located at different positions in the target image and the atlases (Figure 3).Accordingly, we established in E r suitable pairwise connective weights according to spectral similarity A ij ̸ = 0 for nodes x i ∈ S(p), x j ∈ S(q), p ̸ = q.Node pairs belonging to the same sequence are processed by Lconv units; hence, they are disregarded in this case.
The global convolution at layer l is acquired using the spectral convolutional principles in GCN: where S (l−1) ∈ R (N×F (l) in ) and H (l−1) ∈ R (N×F (l) o ) denote the input and output of Gconv, respectively, whereas F g is the learnable embedding matrix and A (l) is the adjacency matrix, as defined in Section 2.1.
Similar to Lconv, we also incorporated the AGL mechanism to Gconv, so that global affinities could be automatically captured at each layer via learning.More concretely, we applied an adaptive scheme whereby the connective weights between nodes are determined from the module's input signals [45].The adjacency matrix elements were computed by: where S (l−1) = BN (S (l−1) ) is obtained after applying batch-normalization to the inputs, σ(•) is the sigmoidal activation function applied on an element-wise operation, and W ϕ is the embedding matrix to be learned, shared across all nodes of G r (V r , E r ).The adaptation scheme in Equation ( 21) assigns greater edge values between nodes with high spectral similarity and vice-versa.In the descriptions above, we considered the GCN model equipped with AGL as a baseline scheme.Nevertheless, in our experimental investigation, we examined several scenarios whereby the global convolution task was tackled using alternative convolutional models, including GraphSage, GAT, GraphSAINT, etc.

Proposed Convolutional Building Blocks
In this section, we present two alternative building models, namely, the cross-talk building model (CT-BM) and the sequential building model (SEQ-BM).They are distinguished according to the way the local and global convolutional units are blended across the layers.Every constituent local and global unit within the structures has its individual embedding matrix of learnable parameters.Further, it is also equipped with its own AGL mechanism for adaptive learning of the graphs, as described in the previous section.

Cross-Talk Building Model (CT-BM)
The CT-BM is shown in the outline of our approach in Figure 1.Nevertheless, a more compact form is depicted in Figure 8.The model comprises two composite layers (l = 1, 2), each one containing one local and one global unit.As can be seen, convolutions proceed in an alternating manner across the layers, whereby the local unit transmits its output to the next global unit, and vice versa.A distinguishing feature of this structure is that there are also skip connections and aggregators which implement cross-talk links between the local and global components.Particularly, in addition to the standard flow from one unit to the next, each unit's output is aggregated with the output of the subsequent unit.The overall workflow of the CT-BM is outlined below: 1.

Sequential Building Model (SEQ-BM)
The architecture of the SEQ-BM is illustrated in Figure 9.This model also contains two local and two global units.Contrary to the CT-BM, convolutions are conducted in the SEQ-BM sequentially.Concretely, the local learning task is first completed using the first two local convolutional units.The outputs of this stage are then transmitted to the subsequent stage which accomplishes the global learning task, using the two global convolutional units.The overall output of the SEQ-BM is formed by aggregating the resulting local and global features of the two stages.
The workflow of the SEQ-BM is outlined as follows: 1.

3.
The final output of the SEQ-BM is obtained by The alternating blending of the CT-BM provides a more effective integration between local and global features at each layer, individually, as compared to the sequential combination in SEQ-BM.This observation is attested experimentally as shown in Section 10.

Proposed Dense Convolutional Networks
In this section, we present two variants of our main model, the DMA-GCN network.The motivation behind this design is based on an attempt to further expand the structures CT-BM and SEQ-BM by including multiple layers of convolutions, to face the gradient vanishing effect, where the gradients diminish, thus hindering effective learning or even worsening the results.This is the reason why in the above building blocks, we are restricted to two-layered local-global convolutions.
To tackle this problem, we resorted to the recent advancements in deep GCNs [36] and developed the DMA-GCN model with a densely connected convolutional architecture utilizing residual skip connections, as shown in Figure 10.As can be seen, the model consists of several blocks arranged across M layers of block convolutions.These blocks are implemented by either CT-BM or SEQ-BM described in Section 5, which leads to two different alternative configurations, the DMA-GCN(CT-BM) and DMA-GCN(SEQ-BM), denote the input and output of the ith block, i = 1, . . ., M, with out being the feature dimensionalities, respectively.The properties of the suggested DMA-GCN are discussed in the following: 1.
The skip connections interconnect the blocks across the layers.Concretely, each block receives as input the outputs of blocks from all preceding layers: for i = 1, . . ., M, where ⊕ denotes the concatenation operator.This allows the generation of deeper GCN structures which can acquire more expressive node features.Overall, the DMA-GCN involves 4M convolutional units.Within each block, two layers of local-global convolutions are internally performed; the resulting outputs are then integrated along the block layers to provide the final output: The other beneficial effect of skip connections is that they allow the final output to have direct access to the outputs of all blocks in the dense network.This assures a better reverse flow of information and facilitates the effective learning of parameters pertaining to the blocks.Since block operations are confined to two-layered localglobal convolutions, overall, we can circumvent the gradient vanishing problem.

3.
In order to preserve the parametric complexity at a reasonable level, similar to [36], we define the feature dimensions of each block in DMA-GCN to be the same: The node feature growth rate caused by the aggregators can be defined as The input dimensions grow linearly as we proceed to deeper block layers, with the last block showing the largest increase To prevent feature dimensionalities from receiving too large values, we considered initially a DMA-GCN model with M = 4 blocks.The particular number of blocks in the above range was then decided after experimental validation (Section 10).

4.
Every block in DMA-GCN is supported with its corresponding AGL process to automatically learn the graph connective affinities at each layer.This is accomplished by applying an attention-based mechanism for local convolutional units (Section 5.1) and an adaptive construction of adjacency matrices from inputs node features (Section 5.2).The output of the DMA-GCN is fed to a two-Layer MLP unit to obtain the label estimates We adopted the cross-entropy error to penalize the differences between the model's output Z o and the corresponding node labels where Y L denotes the subset of labeled nodes.The DMA-GCN network was trained under either the transductive or the inductive learning methods, as discussed in the following section.

Network Learning
In our setting, we considered transductive learning (SSL) as the basic learning scheme for training the DMA-GCN models.In this case, both unlabeled data from the target image to be segmented T and the labeled data from the atlases LB(T ) were used for the construction of the model.Nevertheless, in the experiments, we also investigated the inductive (supervised) learning scenario, whereby training was conducted by solely using labeled data from the atlases.

Transductive Learning (SSL)
The SSL scheme was adapted to the context of image segmentation task elaborated here.In regard to the data used in the learning, SSL can be carried out along two different modes of operation, namely, mini-batch learning and full-batch learning.Next, we detail mini-batch learning and then conclude with full-batch learning, which is a special case of the former one.
Mini-batch learning was implemented by following a three-stage procedure.In stage 1, an initial model was learned and used to label an initial batch of data from T .Stage 2 was an iterative process, where out-of-sample batches were sequentially sampled from T and labeled via refreshing learning.Finally, stage 3 labeled the remaining voxels of the target image using a majority voting scheme.
Stage 1: Learning.In this stage, (t = 0), we started by sampling an initial unlabeled batch X T r (0) from the target image.Then, we used the different steps detailed in Section 4 to construct the corresponding graph of nodes.(a) Given X T r (0), we cre- ated the corresponding aligned sequences (Section 4.3), giving rise to the dataset of root nodes X r (0) = X T r (0), X A r (0) .(b) Next, we incorporated neighborhood information by generating sequence libraries at multiple scales (Section 4.3), leading to the datasets X s (0) = X r (0) ∪ N s (0), s = 1, . . .S, where N s (0) denotes the neighboring nodes.(c) Finally, we considered the overall dataset X(0) = X S (0) that corresponded to the graph G 0 (V 0 , E 0 ) containing the root and neighboring nodes at that stage.
The next step was to perform convolutional learning on graph G 0 (V 0 , E 0 ) using DMA-GCN models.Upon completion of the training process, we accomplished the labeling of F (0) denotes the model's functional mapping, l(•) is the labeling function of the target nodes, and W (0) stands for the network's weights, including the learnable parameters of embedding matrices and attention coefficients, across all layers of the DMA-GCN.Stage 2: iterative learning.This stage followed an iterative procedure, t = 1, • • • , T, whereby at each iteration, out-of-sample batches of yet unlabeled nodes were sampled from T , X T o (t) of size n o .Considering the nodes in X T o as root nodes, we then applied steps (a)-(c) of the previous stage, to obtain the datasets X o,s = X o (0) ∪ N o,s (0), s = 0, . . ., S, and the overall set X o (t) = X o,S (t), which corresponded to a graph of out-ofsample nodes.In the following, data X o (t) were fed to the pretrained model from stage 1, l(X T o ) = F (t) ( X o (t), W (t)).The model was initialized as W (t) = W 0 to preserve previously acquired knowledge.Further, it was subject to several epochs of refreshing convolutional learning, with the aim of adapting to the newly presented data.The above sequential process terminated at t = T when all target nodes were labeled.
Stage 3: labeling of remaining voxels.This was the final stage of target image segmentation, entailing the labeling of target voxels not considered during the previous learning stages.Given that nodes were the central voxels of a generic 5 × 5 × 5 patches, there were multiple remaining voxels scattered within 3 × 3 × 3 volumes.Labeling of these voxels was accomplished by a voting scheme.Specifically, for each x r , the voting function accounted for both the spectral and the spatial distance from its surrounding labeled vertices: where w spec r,i and w spat r,i are normalized weighting coefficients denoting the spectral and spatial proximity, measured by the l 2 norm (Euclidean distance) and l 1 norm (Manhattan distance), respectively.
Full-batch learning is a special case of the above mini-batch learning.In that case, the dataset X T r (0) is a large body of data, comprising all possible target nodes contained in the target ROI.Under this circumstance, the iterative stage 2 is disregarded.Full-batch learning is completed after convolutional learning (stage 1), followed by the labeling of the rest of target voxels (stage 3).

Inductive Learning
Under this setting, the target data remain unseen during the entire training phase.Adapting to the multi-atlas scenario, we devised a supervised learning scheme according to the following steps.(a) For each target image T , we selected the most similar labeled image T = N N (T ) from atlases, where N N (•) is a spectral similarity function used to identify the nearest neighbors of T .(b) The image T along with its corresponding atlas library LB( T ) were used to learn a supervised model F ind ( T ) by applying exactly stage 1 of the previous subsection.(c) Finally, the target image T was labeled by means of F ind ( T ).As opposed to the SSL scenario, the critical difference was that the developed model relied solely on labeled data, disregarding target image information.

Evaluation Metrics
The overall segmentation accuracy achieved by the proposed methods was evaluated using the following three, standard volumetric measures: the Dice similarity coefficient (DSC), the volumetric difference (V D), and the volume overlap error (V OE ).Denoting Y the ground truth labels and Y the estimated ones, the above measures are defined as: Taking into account that the large majority of voxels correspond to either the background class or the two bone classes, we opted to also include the precision and recall classification measures, to better evaluate the segmentation performance on each individual structure.All measures correspond exclusively to the image content delineated by the respective ROI of each evaluated MRI.

Hyperparameter Setting and Validation
The overall performance of the proposed method depends on a multitude of preset parameters, the most prominent of which are the number of atlases N A comprising the atlas library {A, L} i , i = 1, . . ., N A , the number of heads K utilized in the multi-head attention mechanism, and the number of scales S corresponding to the different neighborhood scales.The optimal values of the above hyperparameters, as well as the performance of the DMA-GCN(SEQ) and DMA-GCN(CT) segmentation methods, were evaluated through a 5-fold cross-validation.

Experimental Test Cases
The proposed methodology comprises several components affecting its overall capacity and performance.We hereby present a series of experimental test cases, aiming to shed light on those effects.

1.
Local vs. global learning: In this scenario, we aimed to observe the effect of performing local-level learning in addition to global learning.The goal here was to determine the potential boost in performance facilitated by the inclusion of the attention mechanism in our models.2.
Transductive vs. inductive learning: The goal here was to ascertain whether the increased cost accompanying the transductive learning scheme could be justified in terms of performance, as compared to the less computationally demanding inductive learning.

3.
Sparse dense adjacency matrix: Here, we examined the effect of progressively sparsifying the adjacency matrix A (l) at each layer on the overall performance.We examined the following cases: (1) the default case with a dense A (l) and (2) thresholding A (l) so that each node was allowed connections to 5, 10, or 20 spectrally adjacent ones.

4.
Global convolution models: Finally, we tested the effect of varying the design of the global components by examining some prominent architectures, namely, GCN, ClusterGCN, GraphSAINT, and GraphSAGE

Competing Cartilage Segmentation Methods
The efficacy of our proposed method was evaluated against several published works dealing with the problem of automatic knee cartilage segmentation.

Patch-Based Methods
The patch-based sparse coding (PB SC ) [8] and patch-based nonlocal-means (PB NLM ) [7] methods are two state-of-the art approaches in medical image segmentation.For consistency reasons, similar to the DMA-GCN, we set the patch size for both these methods to (5 × 5 × 5) and the corresponding search volume size to (13 × 13 × 13).The remaining parameters were taken as described in their respective works.

Deep Learning Methods
Here, we opted to evaluate the DMA-GCN against some state-of-the-art deep learning architectures that were successfully applied in the field of medical image segmentation.
SegNet [12]: A convolutional encoder-decoder architecture, utilizing the state-of-theart VGG16 [46] network that is suitable for pixelwise classification.Since we are dealing with 3D data, we split each input image patch into its constituent planes (−xy, −xz, −yz) and fed those to the network.
DenseVoxNet [13]: A convolutional network proposed for cardiovascular MRI segmentation.It comprises a downsampling and upsampling sub-component, utilizing skip connections from each layer to its subsequent ones, enforcing a richer information flow across the layers.In our experiments, the model was trained using the same initialization scheme for the parameters' values as described in the original paper.
VoxResNet [14]: A deep residual network comprising a series of stacked residual modules, each one performing batch-normalization and convolution, also containing skip connections from each module's input to its respective output.
KCB-Net [16]: A recently proposed network that performs cartilage and bone segmentation from volumetric images, by utilizing a modular architecture, where initially, each one of the three sub-components is trained to process a separate plane (sagittal, coronal, axial), followed by a 3D component with the task of aggregating the respective outputs into a single overall segmentation map.
CAN3D [47]: This network utilizes a successively dilated convolution kernel aiming to aggregate multi-scale information by performing feature extraction within an increasingly dilating receptive field, facilitating the final voxelwise classification in full resolution.Additionally, the loss function employed at the final layer consists of a combination of Dice similarity coefficient DSC and DSF , a variant of the standard DSC used in evaluating segmentation results.
Point-Net [15]: Point-Net is a recently proposed architecture specifically geared towards point-cloud classification and segmentation.As an initial preprocessing step, it incorporates a spatial transformer network (STN) [48] that renders the input invariant to permutations and is used to produce a global feature for the whole point cloud.That global feature is appended on the output of a standard multilayer perceptron (MLP) that operates on the initial point-cloud features, and the resulting aggregated features are passed through another MLP in order to provide the final segmentation map.

Graph-Based Deep Learning Methods
In regard to graph-based convolution models, we compared our DMA-GCN approach to a series of baseline GCN architectures to carry out the global learning task.In addition, we considered in the comparisons the multilevel GCN with automatic graph learning (MGCN-AGL) method [30], used for the classification of hyperspectral remote sensing images.The MGCN-AGL approach takes a form similar to the one of the SEQ-BM in Figure 9 to combine the local learning via a GAT-based attention mechanism and global learning implemented by a GCN.A salient feature of this method is that the global contextual affinities are reconstructed based on the node representations obtained after completion of the local learning stage.
Regarding the network and optimization parameters used in our study, we opted for the following choices: after some initial experimentation, the parameter d controlling the node feature growth rate (Section 7) was set to d = 128, resulting in the input feature dimensionality progression 128 → 256 → 384 → 512 for M = 4 dense layers.All models were trained for 500 epochs using the Adam optimizer, with an early stopping criterion halting the training process either when no further improvement was detected on the validation error in the span of 50 epochs, or when the validation error steadily increased for more than 10 consecutive epochs.

Parameter Sensitivity Analysis
In this section, we examine the effect of critical hyperparameters in the performance of the models under examination.The numbers and figures presented for each hyperparameter correspond to results obtained while the remaining ones assumed their optimal determined value.

Number of Selected Atlases N A
The number of selected atlases is a crucial parameter for all methods adopting the multi-atlas framework.Figure 11 shows the effect on the performance of DMA-GCN(CT) and DMA-GCN(SEQ) by sampling the following values N A = 5, ..., 20.For both methods, the number of atlases has a similar effect on the overall performance.The highest score in each case is achieved for N A = 10 atlases and slowly diminishes as that number grows.Constructing the graph by sampling voxels from a small pool of atlases increases the bias of the model, thus failing to capture the underlying structure of the data.Increasing that number allows the image graphs to include a greater percentage of nodes with dissimilar feature descriptions, which enhances the expressive power of features.Accordingly, a moderate number of aligned atlases seems to provide the best overall rates, as it achieves a reasonable balance between bias and variance.

Number of Attention Heads
The number of attention heads is arguably one of the most influential parameters for the local units of our models.Figure 12 demonstrates the effect on performance for K = 0, 4, 8, 12.The trivial case of K = 0 corresponds to the case where the local convolutional units are disregarded, i.e., the convolution task is undertaken solely by the global units.The best performance is achieved for K = 8 for either DMA-GCN(CT) or DMA-GCN(SEQ).Most importantly, in both charts, we can notice a sharp drop in performance when K = 0.This indicates that discarding the local convolutional units significantly aggravates the overall efficiency of the DMA-GCN.Particularly, in that case, the model disregards the local contextual information contained in node libraries, while the node features are formed by applying graph convolutions at a global level exclusively.
It can also be noted that independent embeddings of the attention mechanism provide different representations of the local pairwise affinities between nodes, which facilitates a better aggregation of the local information.Nevertheless, beyond a threshold value, the results deteriorate, most likely due to overfitting.

Sparsity of Adjacency Matrix A
The adjacency matrix A is the backbone of graph neural networks in general, encoding the graph structure and node connectivity.As mentioned in [30], a densely connected A may have a negative impact on the overall segmentation performance.To this end, we evaluated a number of thresholds that served as cut-off points, discarding edges that were not sufficiently strong.A small threshold leaves most edges in the graph intact, while a larger one creates a sparser A by preserving only the most significant edges.Figure 13 summarizes the effects of thresholding A at different sparsity levels.As can be seen, large sparsity values considerably reduce the segmentation accuracy, which suggests that preserving only the strongest graph edges decreases the aggregation range from neighboring nodes, and hence the expressive power of the resulting models.On the other hand, similar deficiencies are incurred for low sparsity values with dense matrices A. In that case, the neighborhood's range is unduly expanded, thus allowing aggregations between nodes with weak spectral similarity.A moderate sparsity of the adjacency matrices corresponding to a threshold value of 0.50 attains the best results for both DMA-GCN(CT) and DMA-GCN(SEQ).

Number of Scales
The number of scales considered in conjunction with the local attention mechanism plays an important role in the overall performance of the DMA-GCN.It defines the size of spatial neighborhoods considered in local convolutional units, which greatly affects the resulting node feature representations.Figure 14 shows the segmentation rates for different values of scales s = 0, 1, 2. As can be seen, for both DMA-GCN(CT) and the DMA-GCN(SEQ) models, the incorporation of additional neighborhoods of progressively larger scales improves the accuracy results, consistently.In the trivial case of s = 0, each node aggregates local information by paying attention solely to its aligned root nodes.Due to the restricted attention, we were led to weak local representations of nodes, and hence degraded overall performance for the models (left columns).Incorporating the one-hop neighborhoods (s = 1), we expanded the range of the attention mechanism, which provided more enriched node features.This resulted in significantly better results compared to the previous case (middle columns).The above trend was further retained by including the two-hop neighborhoods of nodes (s = 2), where we could notice an even greater improvement of results (right columns).

Number of Dense Layers
In this section, we examine the effect of the number M of dense layers used in the DMA-GCN models (Section 7).It defines the depth of the networks and thus directly impacts the size, as well as the representational capabilities of the respective models.Figure 15 shows the results obtained by progressively using up to four dense layers.It should be noticed that the single layer results refer to the case where DMA-GCN(CT) and DMA-GCN(SEQ) coincide with their constituent block models, i.e., the CT-BM and SEQ-BM, respectively.Figure 15a clearly demonstrates an upwards trend in the obtained rates as the number dense layers increases.The lowest performance is unsurprisingly achieved for the shallow network of a single layer (CT-BM).The best results are achieved for M = 3, while the inclusion of an additional layer diminishes slightly the accuracy, an effect most likely attributed to overfitting.A similar pattern of improvements can also be observed for the case of DMA-GCN(SEQ) model.Concluding, the proposed densely connected block architectures lead to deeper GCN networks with multiple local-global convolutional layers, which can acquire more comprehensive node features, and thus offer better results.

Mini-Batch vs. Full-Batch Training
The goal of this section is twofold.First, we aim to compare the mini-batch against full-batch learning schemes (Section 8).Secondly, we examine the effect of the batch size on the performance of the mini-batch learning.Figure 16a,b presents the DSC metrics of the models DMA-GCN(CT) and DMA-GCN(SEQ), respectively, for varying batch sizes.The first four columns refer to the mini-batch learning, while the rightmost column represents the full-batch scenario.
Both figures share a similar pattern of results.Noticeably, in the mini-batch learning, the best results (DSC = 95%) were achieved for a batch size of 128.This value referred to the size of the initial batch (X T 0 ) as well as the out-of-sample batches extracted from the target T .Given these batches, we then proceeded to the formation of the sequences of aligned nodes and the sequence libraries at multiple scales, to generate the corresponding graphs of nodes used for convolutional learning.Finally, the full-batch learning provided a significantly inferior performance (DSC = 81%) compared to the mini-batch scenario.

Global Module Architecture
In this section, we examine the effectiveness of some of the popular graph-based networks in our approach.Concretely, in the context of DMA-GCN structures, we considered several model combinations, whereby the GAT-based attention and local information aggregation was used to carry out the local learning task, while the global task was undertaken by the GCN, SAGE, SAINT and ClustGCN, respectively.
Figure 17 shows the DSC measures for both DMA-GCN(CT) and DMA-GCN(SEQ) models.Table 1 also provides more detailed results on this issue.As can be seen, for both models, the utilization of GraphSAGE clearly provided the best performance, possibly due to its more sophisticated node sampling method.Nevertheless, all the alternatives offered consistently good results.According to the results, transductive learning significantly outperforms the inductive learning scenario, for both cartilage classes of interest and across all evaluations metrics.This can be attributed to the following reasons.First, corroborating the well-established finding of the literature, the superior rates underscore the importance of utilizing the SSL features of the unlabeled nodes in the training process, combined with those of labeled ones.Secondly, the refreshing learning stage applied in mini-batch learning (Section 10.3) allows the network to appropriately adjust to the newly observed out-of-sampling batches.On the other hand, the inductive model is trained once using the nearest neighbor image.This network is then used to segment the target image, by classifying the entire set of unlabeled batches from T .

Comparative Results
Table 3 presents extensive comparative results, contrasting our DMA-GCN models with traditional patch-based approaches, and state-of-the-art deep learning architectures established in the field of medical image segmentation.We also applied six graph-based convolution networks in the comparisons.These networks were used as standalone models solely to conduct global learning of the nodes.Finally, we applied the more integrated MGCCN model [30].For the DMA-GCN models, we applied the following parameter setting: N A = 10 atlases, K = 8 attention heads, S = 2 spatial scales, M = 3 dense layers, and transductive learning with a mini-batch size of 128.
Based on the results of Table 3, we should notice the more enhanced rates of GAT and MGCN compared with those of the other graph convolution networks, suggesting that the attention mechanism combined with a multi-scale consideration of the data can improve the model's performance.Furthermore, the proposed deep DMA-GCN(CT) and DMA-GCN(SEQ) models are both shown to outperform all competing methods in the experimental setup, achieving DSC f mrl = (95.71%,95.44%) and DSC tbl = (94.02%,93.87%), respectively, across all evaluation metrics and in both femoral and tibial segmentation.DMA-GCN(CT) provides a slightly better performance compared to the DMA-GCN(SEQ) indicating that the alternating combination of local-global convolutional units is more effective.However, both methods may fail to deliver satisfactory results in certain cases where the cartilage tissue is severely damaged or otherwise deformed.Figure 18 showcases an example of a successful application of both DMA-GCN(SEQ) and DMA-GCN(CT) models, along with a marginal case exhibiting suboptimal results, due to extreme cartilage thinning.

Conclusions and Future Work
In this paper, we presented the DMA-GCN for knee joint cartilage segmentation.Our models shared a number of attractive properties, such as the constructive integration of local-level and global-level learning, a densely connected structure, and adaptive graph learning.These features rendered the DMA-GCN capable of acquiring expensive node representations.A comparative analysis with various state-of-the-art deep learning and graph-based convolution networks validated the efficacy of the proposed approach.As future research, we intend to extent the current framework by focusing on hypergraph networks, which allow the incorporation of multiple views of graph data.

Figure 1 .
Figure 1. Outline of the proposed knee cartilage segmentation approach.It comprises the atlas subset selection (a), the graph construction part (b), a specific form of graph-based convolutional model (c), the adaptive graph learning (d), and the MLP network providing the class estimates for the segmentation of the target image.Black dots correspond to central nodes and colored nodes to neighboring ones, respectively.The encircled C symbol represents an aggregation function.

Figure 2 .
Figure 2. A typical knee MRI viewed in three orthogonal planes (left to right: sagittal, coronal, axial).

Figure 3 .
Figure 3. Schematic illustration of a sequence of aligned image graphs of root nodes, including the target graph (left) and the graphs of its corresponding atlases (right).There are local spatial affinities at aligned positions (horizontal axis), as well as global pairwise similarities between nodes located at different positions in the ROIs.

Figure 5 .
Figure 5. Schematic illustration of a sequence library for a specific scale s = 1, comprising the aligned neighborhoods from the target and the atlas images.Green arrows indicate the different scopes of the attention mechanism.For a particular root node, attention is paid to its own neighborhood, as well as the other neighborhoods in the sequence.

Figure 7 .
Figure 7. Detailed description of the local convolutional unit, which aggregates local contextual information from node neighborhoods at different spatial scales.

Figure 8 .
Figure 8. Illustration of the proposed cross-talk building model (CT-BM), where local and global convolutional units are combined following an alternating scheme.

Figure 9 .
Figure 9. Illustration of the proposed sequential building model (SEQ-BM), where the local and global learning tasks are carried out sequentially.

Figure 10 .
Figure 10.Description of the proposed DMA-GCN model, with densely connected block convolutional structure and residual skip connections.

Figure 18 .
Figure 18.Segmentation results for femoral (FC) and tibial (TC) cartilage for the two main proposed models (DMA-GCN(SEQ) and DMA-GCN(CT)).The first part of the figure illustrates a case of successful application of DMA-GCN on a healthy knee (KL grade 0), while the second and third parts correspond to more challenging subjects with moderate (KL grade 2) and severe (KL grade 4) osteoarthritis.(Left to right: ground truth, DMA-GCN(SEQ), DMA-CGN(CT)-color coding: pink → FC, white → TC).(a) Segmentation showcase-KL grade 0. (b) Segmentation showcase-KL grade 2. (c) Segmentation showcase-KL grade 4.
Its cardinality is |X s | = N r + |N s |, where N r is the number of root nodes, and |N s | the cardinality of N s .Further, X s corresponds to a subgraph G s (V s , E s ), with |V s | = |X s |.In this subgraph, connective edges are established in E s along the horizontal axis, namely, between root nodes and the neighboring nodes across the sequence libraries.Concluding, since the neighborhoods are by definition inclusive as the scale increases, the dataset X S contains the maximum number of nodes, forming the overall dataset X: X

Table 1 .
Summary of segmentation performance measures (means ± stds) of the two cartilage classes of our proposed method DMA-GCN (CT/SEQ) by varying the global component of the sub-modules (GCN, SAINT, SAGE, ClustGCN).Best results for each category (CT vs SEQ) with respect to DSC index are highlighted.

Table 1 .
Cont.In this final test case, we investigated the efficacy of the transductive against the inductive learning schemes (Section 8.1).Table2presents detailed results pertaining to both DMA-GCN(CT/SEQ) structures.

Table 2 .
Summary of segmentation performance measures (means ± stds) of the two cartilage classes of our proposed methods DMA-GCN (CT/SEQ) by varying the overall learning paradigm (transductive vs. inductive).Best results for each category (CT vs SEQ) with respect to DSC index are highlighted.

Table 3 .
Summary of segmentation performance measures (means ± stds) of the two cartilage classes of our proposed methods DMA-GCN (CT) and DMA-GCN (SEQ) compared to state of the art: 1. patch-based methods, 2. deep learning methods, 3. graph deep learning methods.Best results for all three categories (CT vs SEQ) with respect to DSC index are highlighted.