COCM: Co-Occurrence-Based Consistency Matching COCM: Co-Occurrence-Based Consistency Matching in Domain-Adaptive Segmentation in Domain-Adaptive Segmentation

: This paper focuses on domain adaptation in a semantic segmentation task. Traditional methods regard the source domain and the target domain as a whole, and the image matching is determined by random seeds, leading to a low degree of consistency matching between domains and interfering with the reduction in the domain gap. Therefore, we designed a two-step, three-level cascaded domain consistency matching strategy—co-occurrence-based consistency matching (COCM)—in which the two steps are: Step 1, in which we design a matching strategy from the perspective of category existence and ﬁlter the sub-image set with the highest degree of matching from the image of the whole source domain, and Step 2, in which, from the perspective of spatial existence, we propose a method of measuring the PIOU score to quantitatively evaluate the spatial matching of co-occurring categories in the sub-image set and select the best-matching source image. The three levels mean that in order to improve the importance of low-frequency categories in the matching process, we divide the categories into three levels according to the frequency of co-occurrences between domains; these three levels are the head, middle, and tail levels, and priority is given to matching tail categories. The proposed COCM maximizes the category-level consistency between the domains and has been proven to be effective in reducing the domain gap while being lightweight. The experimental results on general datasets can be compared with those of state-of-the-art (SOTA) methods.


Introduction
Domain-adaptive semantic segmentation (DASS) has received extensive attention.It can greatly alleviate the high cost of manual annotation in intensive prediction.Researchers have made significant progress in exploring methods for adaptation from a labeled source domain to an unlabeled target domain.
Our work pays attention to the DASS method, which is based on adversarial training.Previous methods used vanilla generative adversarial networks (GANs) [1], patch GANs [2], and pixel-level GANs [3].Here, we focus on the pixel-level GAN method.As shown in Figure 1a, the guidance for each pixel comes from two parts: the adversarial loss and task loss.The former pursues domain-invariant space, while the latter maintains the segmentation performance.The intuitive idea is that if these three pixels in the same position belong to the same category, they will produce more sufficient guidance.This kind of matching is called semantic-level consistency matching.The traditional DASS method regards the source domain and the target domain as a whole set, and the image matching is determined by randomly specified seeds.This matching method does not consider the semantic-level consistency between domains, which leads to a negative transfer in domain adaptation.Therefore, we propose a co-occurrence-based consistency matching (COCM) method for the maximization of the inter-domain semantic consistency.COCM is a two-step strategy for matching from coarse to fine, and it selects the optimal source domain corresponding to an image for the target-domain image by using two steps that assess the "more common categories" (existence) and if elements are "in the same position" (space).At the same time, we fully consider the imbalanced distribution of the co-occurrence frequencies of different categories, as shown in Figure 1b.The categories are divided into three levels according to the frequency of the inter-domain co-occurrence of categories-head, middle, and tail, corresponding to the red, green, and blue areas in the figure.The order of priority is in the reverse order of frequency to ensure the contribution of low-frequency categories to the consistency matching.
Previous work related to content-consistent matching (CCM) [4] matched the sourcedomain image by clustering target images.However, this work was focused more on global matching and lacked consideration of inter-domain co-occurrence.Our method pays attention to differential co-occurrence categories in the consistency matching and adjusts for imbalanced distributions.As shown in Figure 2, the three columns are the target domain image, the image matched by the COCM, and the image matched with the vanilla method, respectively.The three lines show the matching at different levels.We can see that the images in the second column basically met the matching target of the same location that corresponded to the same category, while the semantic consistency matching in the third column was not satisfactory.
Our contributions are: • We propose a new co-occurrence based consistent matching (COCM) method.To the best of our knowledge, this is the first effort to explore image matching from the perspective of inter domain category co-occurrence frequency.

•
The COCM is composed of two-step cascade matching and three-level priority strategy.Two-step refers to matching the optimal source image for the target domain image from the existence and spatial matching.Three-level refers to the priority adjustment of category co-occurrence imbalance.

•
We design a new measurement patch intersection over union (PIOU) to measure the spatial similarity between domains.

Adversarial loss
Task loss • Our method is lightweight and proved to be effective.The results on general datasets can compare with SOTA methods.

Domain Adaptive Semantic Segmentation
Domain adaptive semantic segmentation (DASS) is one of the important applications of domain adaptation.The main purpose of this task is to obtain the optimal segmentation performance for the unsupervised target domain.Some of previous methods used adversarial training to maximize domain invariance from the feature space [1,3] or the label space [5], some introduced style transfer to explore the adaptation of segmentation on the basis of style consistency, and some used self-supervised learning to achieve domain adaptation by exploring more accurate pseudo labels.Ref. [6] proposes the contextual-relation consistent domain adaptation (CrCDA), which explicitly learns and enforces prototype local contextual-relations in the feature space of the labeled source domain and transfers them to the unlabeled target domain by adversarial learning.CrCDA applies co-occurrence frequency from the perspective of local contextual-relation, and our method applies cooccurrence frequency from the perspective of image matching.Ref. [7] proposes the pixel level cycle association (PLCA), which establishes pixel-level cycle association between source and target pixel pairs, and, in contrast, strengthens the connection between them to reduce the domain gap.We are inspired by cycle association and applied to class-level consistency matching between domains.

Image Matching Cross Domain
Cross domain image matching refers to selecting the image closest to the target domain from the source domain according to different standards.In feature distribution matching (FDM) [8], feature distribution matching was proposed to match source domain images from the perspective of color features.The work of [9] performs cross domain image matching from the perspective of outlier detection.CCM [4] selects the positive images in the source domain images.The selection strategy is to cluster the target domain and randomly select 20% of the source images to score with the cluster center of the target domain.Inspired by the above image matching methods, our method matches the entire source domain dataset by two steps, and proposes a new spatial similarity evaluation method, PIOU.

Class-Imbalance Learning
Category-imbalance refers to the situation that the number of training samples in different categories varies greatly [10].The current research has proposed several solutions, such as re-sampling, cost-sensitive learning, or transfer learning.In DASS task, some studies also pay attention to the imbalance distribution.Class-balanced self-training (CBST) [11] adjusts the category imbalance in the process of generating pseudo labels.Our method comprehensively considers the imbalance of the co-occurrence category inter domain and conduct category rebalancing through three levels of priority.3, can be regarded as three steps in training: 1.
The target domain image I t is input into F to obtain segmentation header F(I t ) and input F(I t ) into ASPP to output prediction P(I t ).

2.
The source domain image I s is input into the segmentation network, and the segmentation header F(I s ) and the result P(I s ) are output.For P(I s ), the cross-entropy loss is used to maintain the performance of the segmentation network.The calculation of the cross-entropy loss is as follows: 3.
The segmentation header F(I t )/F(I s ) of the source domain and the target domain is input to the domain discriminator D. The function of discriminator is to narrow the distribution of source domain and target domain, and maximize the shared information between domains.The discriminator uses adversarial loss as follows: Vanilla methods regard the source domain and the target domain as a whole set, and the image matching is determined by specified random seed with dataloader module.Here 'regard the source domain and the target domain as a whole' means that when dealing with domain gap, the entire source domain image set is considered as a uniform distribution of categories, and each source domain image plays an equally important role in adaptation.In model training, source domain and target domain images are randomly selected as input of model.In COCM, the target domain still uses dataloader to set the order, while the source domain image is matched according to the target image prediction.

Organized by random seed
Target image

Co-Occurrence Based Consistent Matching
We show the overall frame of COCM in Figure 4. We can see from (a) that compared with the vanilla method in Figure 3, we added the COCM module to match the source domain image.The process of domain adaptation training after matching is the same as that of vanilla method.In (b), we show the existence matching and spatial matching, and next we will explain it in detail.

Existence Matching
Due to the domain gap, it cannot be guaranteed that each pair of matched images contains the same category.Therefore, we first performed existence matching to screen out the subset with the highest degree of co-occurrence.
For the target domain image, we obtain a one-hot vector E tgt indicating whether the category exists according to the current prediction result P(I t ) as shown in Figure 4b.The vector is in 1 × C dimension.If the corresponding bit is 1, the category exists in the image.If the bit is 0, the category does not exist.For the source domain, we obtain the category existence information of all images according to the GroundTruth, and each image corresponds to a heat vector.The entire dataset is pre-generated in the form of matrix M src .
For category level existence matching, we traverse the target domain existence vector E tgt across the source domain matrix M src to find the candidate subset with the highest matching degree.We design matching strategy with three-level priority.According to the co-occurrence frequency distribution inter domain mentioned in Figure 1, the categories are divided by threshold into three levels: head, middle and tail.In matching, if there is at least one common category in the tail level, it is marked as tail level matching and the number of common categories is counted.If the tail level does not match at all, then retrieve and find whether the middle level has at least one common category and count them.If the middle level does not match at all as well, retrieve and find whether the head level has at least one common category and count them.
The existential matching outputs the candidate image set with the largest number of common categories in marked matching level.Existence matching performs coarse preliminary screening from the entire target domain data set, and the following spatial matching performs fine selection from the perspective of location.

Spatial Matching
Spatial matching aims to achieve the goal of the same location and the same category to the maximum extent.Different from the traditional pixel-level image matching, cross domain image matching looks for the image with the same overall layout and the same position of low-frequency category.Therefore, 'the same position' in COCM refers to one patch.Inspired by the classical measurement method MIOU, we propose the method of PIOU to quantitatively measure the spatial similarity of common categories.The scoring method is as follows: where i refers to the category of co-occurrence, PIOU i means Patch Intersection-over-Union of each category, and w type is used as a hyper parameter to adjust the importance of low-frequency category in matching.
We partitioned the image into patches.For the target image, we divide it into H/N × W/N patches according to the current prediction result P(I t ), number each position from 0 ∼ H/N × W/N − 1, and count the patch number covering each category.Where N is a hyper parameter to adjust the size of the patch.For the source images, we divide the patches in advance according to the ground truth and record the category space information of all images.
For all co-occurrence categories, we calculate and sum the intersection and union ratio of the covering patch blocks on the target domain and the source domain.To highlight the contribution of low-frequency categories to the total score, we adjust it by different weights.Finally, we output the image with the highest score as the corresponding source domain image of the current target domain image.To avoid repeatedly selecting the same source domain image.We recorded the selected images of each target domain image and excluded them from the candidate list before each match.The calculation of PIOU is as follows: To better present the calculation of PIOU, we chose a set of representative images in Figure 5.By existence matching, we know the co-occurrence categories are rider and motorcycle.Here we calculate the scores of rider.The upper left is the target domain image, the lower left is the image to be scored in the source domain, the upper middle is the prediction result of the target domain after patch division with N = 6, and the lower middle is the ground truth of the source domain image after patch division.We use black lines to mark the patch division in the middle images.The area marked by white lines is the area covered by the rider in the prediction results and the ground truth.The upper right blue line represents the intersection area, and the lower right yellow line represents the union area.Here we use the number of blocks to measure the spatial matching of rider, so PIOU rider = 6/8(0.75).We choose the patch segmented image for spatial matching for the following reason: Firstly, the computational cost of pixel level position matching is too high, which seriously affects the selection efficiency.Second, the prediction result of the target domain is not completely accurate.Patch level matching can produce a certain fault tolerance for wrong pixel matching.Third.Patch matching is a relaxation strategy.As long as the prediction results indicate the approximate positions of the co-occurrence categories, the COCM can match their corresponding images in the source domain.

Training Procedure of COCM
The training procedure of our proposed method is summarized in Algorithm 1. T Iter obtains the current batch of I t and input the segmentation network F + ASPP to obtain P(I t ).

7:
Existence matching: calculate category existence vector V(I t ) of P(I t ), and traverse M Exist to find the subset I S SUB with the most matching digits according to the priority of tail > middle > head.

8:
Spatial matching: calculate category location tuple T(I t ) and traverse I S SUB .Find image set I S MAX with the highest score according to Formula (3) and M space and randomly select one of them as the corresponding image I s .

9:
I s input the segmentation network F + ASPP to obtain P(I s ), and calculate segmentation loss in Formula (1).

10:
I t and I s input the discriminator D, and calculate the adversarial loss in Formula (2).

11:
By alternately training F + ASPP and D, F + ASPP is encouraged to generate domain-invariant features. 12: Update the frequency of inter domain category co-occurrence after a fixed iteration.13: end for 14: return θ T

Experiments 4.1. Datasets
Following the general data set of DASS, we choose the Cityscapes [12] as target domain.The data set contains 2975 training images and 500 validation images.Images were collected from more than 50 cities including Aachen, Bochum, and Bremen, reaching 1024 × 2048 resolution.The image set has 30 predefined categories, 19 of which are used in the semantic segmentation task.
For source set we use GTA5 [13] and SYNTHIA [14] datasets.GTA5 dataset obtains street view images from the classic commercial game GTAV, and generates a large number of high-resolution annotation images by computer graphics technology.GTA5 contains 24,966 images with a resolution of 1914 × 1052.The image set predefines 19 categories to match Cityscapes.SYNTHIA is an urban street view data set generated by the Utility development tool, with a resolution of 1280 × 960.Here we use the subset SYNTHIA-RAND-CITYSCAPES because its annotation space corresponds to cityscapes.The total number of images reached 9400.

Implementation Details
The backbone of the segmentation network adopts the ResNet-101 [15] model based on DeepLab-V2 [16] structure.We use the segmentation model pre-trained on ImageNet [17] in the initial state.For the segmentation network, our structure includes five convolution layers, with a convolution core of 4, the number of channels of {64, 128, 256, 512, 1}, a step size of 2, which is similar to the structure of AdaptSegNet [5].Following the training setting of AdaptSegNet [5], the optimizer of the feature extractor uses SGD [18], the momentum value is 0.9, the weight decay value is 10 ×10 −4 , the initial learning rate is 2.5 ×10 −4 , and the poly learning rate policy is used for attenuation.For the discriminator, our structure consists of three convolution layers, with a convolution kernel of 3, the number of channels of {256, 128, 2C} (C refers to the number of categories) which is similar to the structure of [3], and the step size of 1.The optimizer of discriminator uses Adam [19], where β 1 = 0.9, β 2 = 0.99, the weight decay value is set to 10 ×10 −4 , the initial learning rate is 2.5 ×10 −4 , and the poly learning rate policy is used for attenuation.
We set the batch size to 6 in GTA5 → Cityscapes and 4 in SYNTHIA → Cityscapes, respectively.The corp size in the target domain is set to 1024 × 512, in the source domain are set to 1280 × 760.Hyperparameter λ adv set to 0.01.The thresholds of {head, middle, tail} were set to {0.9, 0.3} in GTA5 → Cityscapes and {0.9, 0.5} in SYNTHIA → Cityscapes, respectively.We have a supervised training source domain in advance and serve as the initialization of domain adaptation and update the frequency of inter domain category co-occurrence every 2000 iteration.To further improve the performance, we used selfdistillation [20] with multi-scale in testing stage.Our experiment is implemented in the Pytorch library on a GTX 3090 with 24 GB memory.
For the evaluation metrics, we use the commonly used evaluation metrics in DASS: Mean Intersection-over-Union(MIOU) [21].Where Intersection-over-Union (IOU) evaluates the accuracy of the corresponding class, and MIOU calculates the average value of IOU.

From GTA5 Adapt to Cityscapes
In Table 1, ,we can see that the performance of our method can reach 51.1% MIoU on the validation set and 52.6% MIoU on the test set, of which eight categories are optimal.Those categories with high-frequency co-occurrence, such as road, building, vegetation, terrain, sky, person, and car have achieved results close to oracle, which proves the effectiveness of our method.Meanwhile, low-frequency co-occurrence categories, such as rider and motor, have also made significant improvements.We also observed that adaptation in some categories with low-frequency co-occurrence, such as traffic sign or train, was not satisfactory.We found that the feature similarity between these categories is very low, which is difficult to distinguish by human eyes.Even the bus in GTA5 dataset looks more similar to the train in Cityscapes dataset.We consider that image semantic matching cannot improve the performance of feature dissimilar categories and unseen categories cross domain, but can significantly improve the performance of similar features.This is also the disadvantage of consistent image matching between domains.Table 2 shows the quantitative comparison on SYNTHIA → cityscapes task.Our proposed method can achieve an accuracy of 52.7% MIoU in the validation set, in which the road category is optimal and the performance of high-frequency categories such as sidewalk, traffic sign, sky, car, and motor ranked second.This is basically in line with our expectations.The overall adaptation performance of SYNTHIA data set is not as good as that of GTA5.We believe that it is due to the relatively low degree of realism of image set. Figure 6 shows quantitative adaptation results on GTA 5→ cityscapes task.Here we selected three representative images to cover different categories.For each column, we show target image, source only result, adapted result with COCM, and ground truth image from left to right, respectively.For each row, we aim to show the domain adaptation visualization of categories of bus, bicycle, sidewalk, and traffic light.The first row shows the optimization effect of bus category.Our method can cover a large area of bus, although there is still some confusion with train category.It can be seen from the second row that the bicycle is separated after adaptation, although it still overlaps the car to a certain extent.The performance of traffic lights in the third row has been significantly improved, and even the outline is clear.In addition, we can see from the three rows that the segmentation of road and sidewalk is smoother and more coherent.The above is consistent with our previous analysis.

Image Matching Visualization
In Figure 7, we show an image of the target domain and the corresponding image of the source domain selected for five consecutive times.In the first row, the target domain image contains the tail level category rider, which appears in the corresponding positions of the following five graphs.The second row shows the image matching of car category, which basically realizes the goal of same category in same location, but there is a confused matching in the third image matching the truck in the corresponding location.The third row shows the matching of the bus category.We can see that the bus appears in the corresponding position of the first, second, fourth, and fifth source images, and the bus appears on the left of the image in the third.The overall layout structure of the above images, including road, building, and sky, is roughly similar.The tail category basically appears in the corresponding position, which is in line with our expectation.Figure 8b shows negative examples.From left to right, is the unseen category, the feature dissimilar category, and false prediction.We can see that the coach in the left part is an unseen category for the source domain, and the images in the source domain are not able to provide guidance.In the middle part, train represent categories with dissimilar features with source domain.The train in the GTA5 domain is dissimilar with the train in cityscapes, but is similar to the bus in cityscapes.Therefore, the prediction result is not insufficient to guide the COCM to match the appropriate source domain image.The right part shows the situation of prediction error on car.Since large-area roads are incorrectly predicted to be car, COCM is guided to match the source domain image containing large-area of car.Among these three negative examples, only the false prediction can be corrected with the increase in training epoch.This is consistent with the performance in quantitative experiments.For unseen category, feature dissimilar category, we need to further adapt by means of multi-source domain and few-shot learning.This is the weakness of COCM and our future research direction.

Ablation and Parameter Studies
We performed ablation experiments on GTA5 → cityscapes adaptative semantic segmentation task to verify each components, respectively, and ablation results are shown in Table 3.Here, PIOU is divided into EM and SM, SP denotes source domain pre-training, EM means existence matching SM represents spatial matching, and SD represents selfdistillation strategy.We can see that only EM has achieved an improvement of 10.1% MIOU and only SM has achieved an improvement of 8.3% MIOU.EM + LM can achieve an improvement of 11.8% MIOU, and further combined with SD can achieve 14.9% MIOU, which verifies the effectiveness of our method.To study the impact of patch partition size on COCM performance, we tested different patch sizes.It can be seen from Table 4, the smaller N is, the larger the corresponding patch size, and the smaller the number of patches of a single image.We observe that the larger N is, the smaller the proportion of selected images in the source domain is.When N = 6, the performance reaches the optimal.Further, we discuss the weight parameter w type , which represents the contribution of non-head categories in PIOU score.As shown in Figure 9, with the increase in non-head category contribution, the adaptive performance is improved and reach optimal when w type is 3. Following the work of FADA, we studied the effect of parameter T, which represents the degree of smoothness of the distribution of the prediction results over the categories.As can be seen from right Figure 9, it reaches the optimum when T is 1.6.
Moreover, we counted the changes in the co-occurrence frequency of category existence before and after the use of COCM on the GTA5 → cityscapes task.In Figure 10, we can see that our method has significantly improved in almost the tail category.This is consistent with our previous analysis and proves the effectiveness of our method.

Conclusions
Our method focuses on the category level consistent matching of inter domain images, and designs a three-level two-step cascade matching strategy COCM to select images that meet the same location and category to the maximum extent.In this process, we deal with co-occurrence categories in an imbalance way, and propose a measurement method of PIOU in spatial matching.Our method effectively improves the class level co-occurrence between domains.Experiments prove that we reduce the domain gap on most semantic categories.At the same time, we also analyze the disadvantage of our method.The effect on unseen categories and feature dissimilar categories is not satisfactory.Therefore, in the future work, we can further improve by means of multi-source domain and few-shot learning.The multi-source domain method can compensate the unseen categories in a single source domain and improve the guidance from the source.The few-shot learning can correct the feature dissimilar categories in the source domain and adjust the deviation generated by the model.

Figure 1 .
Figure 1.(a) A group of images during training.From left to right, they are the target image, source image, and pixel-level ground truth.(b) The imbalanced distribution of the co-occurrence frequency in different categories.

Figure 2 .
Figure 2. Examples of co-occurrence-based consistency matching on three levels.

Figure 3 .
Figure 3. Vanilla DASS training flow based on adversarial training.Step 1, 2, and 3 in the figure are explained in detail in Section 3.1.

Figure 4 .
Figure 4. Overall frame of co-occurrence based consistent matching (COCM).The left side is the proposed training flow.Our proposed COCM follows step 1 in the vanilla DASS algorithm as step 2. Therefore, steps 2 and 3 in vanilla DASS algorithm become steps 3 and 4. The right side shows the details of COCM, and from top to bottom are the two steps of existence and spatial matching.

Algorithm 1 :
Training procedure of COCM.Input: The source image set I s and ground truth Y s ; The target image set I t ; The sourcedomain parameter θ S ; The iteration number T; Threshold of head/middle/tail.Output: Adapted target-domain segmentation network parameter θ T 1: Train source domain supervised and share θ S with θ T .2: Use Y s to calculate the existence information M exist and spatial information M space of source image set.3: Initialize category co-occurrence frequency with source domain category frequency.4: Generate the target domain image iterator T Iter with random seed.5: for iteration 1 to T do 6:

Figure 7 .
Figure 7. Target domain image and the five corresponding images matching by COCM.4.4.3.Matching ExamplesTo further illustrate the advantages and disadvantages of the COCM method, we selected a group of positive matching examples and negative matching examples for visualization.Positive examples can be seen in Figure8a, the prediction results (upper right of each group of figures) can indicate their approximate positions, and the COCM can match the images of their source domain counterparts.This is in line with our expectations.

Figure 8 .
Figure 8. Positive matching examples and negative matching examples.
(a) Positive examples.From left to right are matching examples of rider and bicycle, car, bus and traffic sign.(b) Negative examples.From left to right are ma tching examples of unseen category, feature diss im ilar category and fal se predictions.

Figure 9 .
Figure 9. Parameter study on w type and T.Here w type = 1 means equally important, and 2, 3, and 4 indicate different degrees of importance.

Figure 10 .
Figure 10.Category co-occurrence comparison.The blue line represents the vanilla method and the yellow line represents the COCM method.
First, we define the symbols involved in COCM: source domain image I s and ground truth Y s , target domain image I t , feature extractor F, segmentation module atrous spatial pyramid pooling (ASPP), domain discriminator D, wherein the segmentation network is composed of F + ASPP.H, W, and C denote height, width, and category number, respectively.Vanilla DASS algorithm based on adversarial training, as shown in Figure

Table 4 .
Parameter study of patch partition on GTA5 → cityscapes adaptative semantic segmentation task.