Dense Out-of-Distribution Detection by Robust Learning on Synthetic Negative Data

Standard machine learning is unable to accommodate inputs which do not belong to the training distribution. The resulting models often give rise to confident incorrect predictions which may lead to devastating consequences. This problem is especially demanding in the context of dense prediction since input images may be only partially anomalous. Previous work has addressed dense out-of-distribution detection by discriminative training with respect to off-the-shelf negative datasets. However, real negative data may lead to over-optimistic evaluation due to possible overlap with test anomalies. To this end, we extend this approach by generating synthetic negative patches along the border of the inlier manifold. We leverage a jointly trained normalizing flow due to a coverage-oriented learning objective and the capability to generate samples at different resolutions. We detect anomalies according to a principled information-theoretic criterion which can be consistently applied through training and inference. The resulting models set the new state of the art on benchmarks for out-of-distribution detection in road-driving scenes and remote sensing imagery despite minimal computational overhead.


I. INTRODUCTION
I MAGE understanding involves recognizing objects and localizing them down to the pixel level [1].In its basic form, the task is to classify each pixel into one of K predefined classes, which is also known as semantic segmentation [2].Recent work improves perception quality through instance recognition [3], depth reconstruction [4], semantic forecasting [5], and competence in the open world [6].
Modern semantic segmentation approaches [2] are based on deep learning.A deep model for semantic segmentation maps the input RGB image x 3×H×W into the corresponding prediction y K×H×W .Typically, the model parameters θ are obtained by gradient optimization of a supervised discriminative objective based on maximum likelihood.Recent approaches produce high-fidelity segmentations of large images in real time even when inferring on a modest GPU [7].However, standard learning is susceptible to overconfidence in incorrect predictions [8], which may make the model unusable in presence of semantic outliers [9] and domain shift [10].This poses a threat to models deployed in the real world [11], [12].
We study ability of deep models for natural image understanding to deal with OOD input.We desire to correctly segment the scene while simultaneously detecting anomalous objects which are unlike any scenery from the training dataset [13].Such capability is important in real-world applications like road driving [14], [15] and remote sensing [16], [17].
Previous approaches to dense OOD detection rely on Bayesian modeling [18], image resynthesis [14], [19], [20], recognition in the latent space [12], or auxiliary negative training data [21].However, all these approaches have significant shortcomings.Bayesian approaches and image resynthesis require extraordinary computational resources that hamper development and makes them unsuitable for real-time applications.Recognition in the latent space [12] may be sensitive to feature collapse [22], [23] due to relying on pre-trained features.Training on auxiliary negative data may give rise to undesirable bias and over-optimistic evaluation.Moreover, appropriate negative data may be unavailable in some application areas such as medical diagnostics [24] or remote sensing [16], [25].Our experiments suggest that synthetic negatives may come to aid in such cases.
This work addresses dense out-of-distribution detection by encouraging the chosen standard dense prediction model to emit uniform predictions in outliers [26].We propose to perform the training on mixed-content images [21] which we craft by pasting synthetic negatives into inlier training images.We learn to generate synthetic negatives by jointly optimizing high inlier likelihood, and uniform discriminative prediction [26].We argue that normalizing flows are better than GANs for the task at hand due to much better distribution coverage and more stable training.Additionally, normalizing flows can generate samples of variable spatial dimensions [27] which makes them suitable for mimicking anomalies of varying size.This paper proposes five major improvements over our preliminary report [28].First, we show that Jensen-Shannon divergence is a criterion of choice for robust joint learning in presence of noisy synthetic negatives.We use the same criterion during inference, as a score for OOD detection.Second, we propose to discourage overfitting the discriminative model to synthetic outliers through separate pre-training of the discriminative model and the generative flow.Third, we offer theoretical evidence for the advantage of our coverageoriented synthetic negatives with respect to their adversarial counterparts.Fourth, we demonstrate utility of synthetic outliers by performing experiments within the domain of remote sensing.These experiments show that of-the-shelf negative datasets such as ImageNet, COCO or Ade20k do not represent a suitable source of negative content for all possible domains.Fifth, we show that training with synthetic negatives increases the separation between knowns and unknowns in the logit space, which makes our method a prominent component of future dense open-set recognition systems.We refer to the consolidated method as NFlowJS.NFlowJS achieves state-ofthe-art performance on benchmarks for dense OOD detection in road driving scenes [11], [12] and remote sensing images [16], despite abstaining from auxiliary negative data [21], image resynthesis [14], [19] and Bayesian modelling [18].Our method has a very low overhead over the standard discriminative model, making it suitable for real-time applications.

II. RELATED WORK
Several computer vision tasks require detection of unknown visual concepts (Sec.II-A).In practice, this often has to be integrated with some primary classification task (Sec.II-B and II-C).Our method generates synthetic negatives with a normalizing flow due to outstanding distribution coverage and capability to work at arbitrary resolutions (Sec.II-D).

A. Anomaly detection
Anomaly detection, also known as novelty or out-ofdistribution (OOD) detection, is a binary classification task which discriminates inliers from outliers [29], [30].Indistribution samples, also known as inliers, are generated by the same generative process as the training data.Contrary, anomalies are generated by a process which is disjoint from the training distribution [31].Samples of anomalous data may or may not be present during the training [32], [33].The detection is typically carried out by thresholding some OOD score s δ : [0, 1] 3×H×W → R which assigns a scalar score to each test sample.Some works address OOD detection in isolation, as a distinct computer vision task [31], [34]- [39].Our work considers a different context where OOD detection is jointly solved with some discriminative task.

B. Classification in presence of outliers
OOD detection [40] can be implemented by extending standard classifiers.The resulting models can differentiate inliers while also detecting anomalous content.A widely used baseline expresses the OOD score directly from discriminative predictions as s(x) = max softmax(f θ (x)) [40].Entropybased detectors can deliver a similar performance [41], [42].Another line of work improves upon these baselines by preprocessing the input with anti-adversarial perturbations [32].Such perturbations cause significant computational overhead.
OOD detection has to deal with the fact that outliers and inliers may be indistinguishable in the feature space [23].Feature collapse [22], [43] can be alleviated by training on negative data which can be sourced from real datasets [33], [42] or generative models [26], [28], [44].
There are two prior approaches for replacing real negatives with synthetic ones [26], [45].A seminal approach [26] proposes cooperative training of a generative adversarial network and a standard classifier.The classifier loss requires uniform predictions in generated samples and thus encourages the generator to yield samples at the distribution border.This idea can be carried out without a separate generative model, by leveraging Langevin sampling [45].However, adapting these approaches for dense prediction is not straight-forward.
Similarly, synthetic outliers can be generated in the feature space by fitting GMM on known features [44].However, our experiments indicate that this approach underperforms with respect to synthetic negative samples in input space.
Out-of-distribution detection gets even more complicated in the case of object detection and dense prediction where we have to deal with outlier objects in inlier scenes.These models strive to detect unknown hazards while correctly recognizing the rest of the scene [46]- [48].A principled Bayesian approach to dense OOD detection attempts to estimate epistemic uncertainty [18].However, the assumption that MC dropout corresponds to Bayesian model sampling may not be satisfied in practice.Another principled approach builds on likelihood estimation in feature space [12].However, this may be vulnerable to feature collapse [22].
Another line of work resynthesizes the input scene by processing dense predictions with a conditional generative model [14], [19], [49].Subsequently, anomalous pixels are detected in reconstructive fashion [30] by measuring dissimilarity between the input and the resyntesized image.Still, these approaches can detect anomalies only in front of simple backgrounds such as roads.Also, resynthesis requires a significant computational budget which limits real-world applications.A related approach utilizes a parallel upsampling path for input reconstruction [15].This improves inference speed with respect to resynthesis approaches but still infers slower than our approach while underperforming in cluttered scenes.
Several approaches train on mixed-content images obtained by pasting negative patches into positive training examples [19], [21], [50].The negative dataset should be as broad as possible (eg.ImageNet or ADE20k) in order to cover a large portion of the background distribution.The training can be implemented through a separate OOD head [21] or by requiring uniform prediction in negative pixels [50].However, this kind of training results in biased models: test anomalies that are related to negative training data are going to give rise to above-average outlier detection performance.Furthermore, competition on popular benchmarks may gradually adapt negative training data to test anomalies, and thus lead to over-optimistic performance estimates.Our method avoids the bias of particular negative data by crafting problem-specific negative samples at the border of the inlier distribution.

C. Open-set recognition
Open-set recognition [51] discourages excessive generalization for known classes and attempts to distinguish them from the remaining visual content of the open world.This goal can be achieved by rejecting classification in input samples which do not belong to the known taxonomy [51]- [54].The rejection mechanism is usually implemented by restricting the shape of the decision boundary [55].This can be carried out by thresholding the distance from learned class prototypes in the embedding space [56], [57].Decision boundary can also be restricted by requiring a sufficiently large projection of the feature vector onto the closest class prototype [58].This is also known as max-logit detector which can be equally used for OOD detection and open-set recognition [58], [59].
Open-set recognition performance can be further improved by employing a stronger classifier [59] or training on negative data [60], [61].Unlike OOD detection approaches based on softmax, open-set recognition methods provably bound openspace risk [51], [62].However, these approaches are still vulnerable to feature collapse [22].We direct the reader to [63], [64] for a broader overview of open-set recognition.Open-world approaches attempt to disentangle the detected unknown concepts towards new semantic classes.This can be done in incremental [6], [65] or low-shot [66]- [68] settings.
Although we mainly focus on OOD detection, our synthetic negatives could be considered as synthetic known unknowns [60], [61].Our experimental evaluation suggests that our synthetic negatives increase the separation between known and unknown data in feature space.This suggests that they may be helpful for open-set recognition [58], [59].

D. Generative models for synthetic negative data
We briefly review generative approaches and discuss their suitability for generating synthetic negative training samples.Energy-based [69] and auto-regressive [70] approaches are unsuitable for this task due to slow sampling.Gaussian mixtures are capable of generating synthetic samples in the feature space [44].VAEs [71] struggle with unstable training [72] and have to store both the encoder and the decoder in GPU memory.GANs [73] also require lots of GPU memory and the produced samples do not span the entire support of the training distribution [43].On the contrary, normalizing flows [27] offer efficient sampling and outstanding distribution coverage [74].
Normalizing flows [27], [75] model the likelihood as bijective mapping towards a predefined latent distribution p(z), typically a fully factorized Gaussian.Given a diffeomorphism f γ , the likelihood is defined according to the change of variables formula: This setup can be further improved by introducing stochastic skip connections which increase the efficiency of training and improve convergence speed [74].
A normalizing flow f γ can be sampled in two steps.First, we sample the latent distribution to obtain the factorized latent tensor z.Second, we recover the corresponding image through the inverse transformation x = f −1 γ (z).Both the latent representation and the generated image have the same dimensionality (R 3×H×W → [0, 1] 3×H×W ).This property is useful for generating synthetic negatives since it allows to sample the same model on different spatial resolutions [27].

III. DENSE OOD DETECTION WITH NFLOWJS
We train dense OOD detection on mixed-content images obtained by pasting synthetic negatives into regular training images.We generate such negatives by a jointly trained normalizing flow (Sec.III-A).We train our models to recognize outliers according to a robust information-theoretic criterion (Sec.III-B), and use the same criterion as our OOD score during inference (Sec.III-C).Finally, we present a theoretical analysis which advocates for training with synthetic negatives generated through likelihood maximisation (Sec.III-D).

A. Training with synthetic negative data
We assemble a mixed-content image x ′ by sampling a randomly sized negative patch x − from a jointly trained normalizing flow f γ , and pasting it atop the inlier image x + : (2) The binary mask s identifies pixels of a pasted synthetic negative patch within the input mixed-content image.As usual in normalizing flows, z is sampled from a factorized Gaussian and reshaped according to desired spatial resolution.The negative patch x − is zero-padded in order to allow pasting by addition.The pasting location is selected randomly.We train our discriminative model by minimizing crossentropy over inliers (s ij = 0) and maximizing prediction entropy in pasted negatives (s ij = 1) [26], [33], [42]: (3) We jointly train the normalizing flow alongside the primary discriminative model (cf. Figure 1) in order to satisfy two opposing criteria.First, the normalizing flow should maximize the likelihood of inlier patches Second, the discriminative model should yield uniform distribution in generated pixels.The former criterion aligns the generative distribution with the inliers, while the latter pulls them apart.Such training encourages generation of synthetic samples at the boundary of the training distribution and incorporates outlier awareness within the primary discriminative model [26].The total loss applied to the generative model equals to: L nll is the negative log-likelihood of the inlier patch which gets replaced with the synthetic sample.Formally, we have ) where p γ is defined in (1).We scrutinize L neg in the following section.The end-to-end training procedure minimizes the following loss: Given enough training data and appropriate capacity, our synthetic negatives are going to encompass the inlier manifold.Consequently, our method stands fair chance to detect visual anomalies that had not been seen during training due to being closer to synthetic negatives than to the inliers.Figure 2 shows this on a 2D toy example.The red color corresponds to higher values of the OOD score.The left plot presents the max-softmax baseline [40] which assigns high OOD score only at the border between the inlier classes.The right plot corresponds to our setup which discourages low OOD scores outside the inlier manifold.Synthetic negatives are denoted with red stars, while inlier classes are colored in blue.

Inlier classes
Synthetic negative samples NFlowJS Max-softmax Fig. 2. Softmax-activated discriminative models do not bound the input-space volume with confident predictions (blue region, left).We address this issue by learning a generative normalizing flow for a "negative" distribution that encompasses the training manifold (red stars, right).Training the discriminative model to predict high entropy in the generated synthetic negative samples decreases the confidence outside the inlier manifold (red region, right).

B. Loss in synthetic negative pixels
The loss L neg has often been designed as KL-divergence between the uniform distribution and the model's predictive distribution [26], [40], [42].However, our generative model is also subjected to the L nll loss.Hence, the generated samples occasionally contain parts very similar to chunks of inlier scenes, which lead to confident predictions into a known class.Unfortunately, such predictions lead to unbounded penalization by KL divergence and can disturb the classifier which is also affected by L neg .If L neg overrides L disc in such pixels, then the classifier may assign high uncertainty in inliers.In that case, the incidence of false positive anomalies would severely increase.We address this problem by searching for a more robust formulation of L neg .
The left part of Figure 3 plots several f-divergences in the two-class setup.We observe that the Jensen-Shannon divergence mildly penalizes high-confidence predictions, which makes it a suitable candidate for a robust loss.Such behaviour promotes graceful performance degradation in cases of errors of the generative model.The right part of Figure 3 visualizes a histogram of per-pixel loss while fine-tuning our model on road-driving images.The figure shows that the histogram of JS divergence has fewer high-loss pixels than the other f-divergence candidates.Long tails of the KL divergences (forward and reverse) indicate a very high loss in pixels that resemble inliers.As hinted before, these pixels give rise to very high gradients with respect to the parameters of the discriminative model.These gradients may override the impact of the standard discriminative loss L disc , and lead to highentropy discriminative predictions that disrupt our anomaly score and lead to false positive predictions.Consequently, we formulate L neg in terms of JS divergence between the uniform distribution over classes and the softmax output:  predictions through arg-max.The bottom branch recovers the dense OOD map through temperature scaling, softmax and JS divergence with respect to the uniform distribution.Our dense OOD score at every pixel i, j reflects the L neg loss (6): U stands for uniform distribution over inlier classes, l represents logits while T is a temperature hyperparameter.The two branches are fused into the final outlier-aware segmentation map.The OOD map overrides the closed-set prediction wherever the OOD score exceeds a dataset-wide threshold.

Dense prediction model
Arg max Temperature scaling [8] reduces the relative OOD score of distributions with two dominant logits as opposed to distributions with homogeneous non-maximum logits.This discourages false positive OOD responses at semantic borders.We use the same temperature T=2 in all experimental comparisons with respect to previous methods.Note that our inference is very fast since we use our generative model only to simulate anomalies during training.This is different from image resynthesis [14] and embedding density [12] where the generative model has to be used during inference.Next, we compare the distributional coverage of synthetic negatives generated by normalizing flow with respect to their GANgenerated counterparts.

D. Coverage-oriented generation of synthetic negatives
We provide a theoretical argument that our synthetic negatives provide a better distribution coverage than their GAN counterparts [26].Our argument proceeds by analyzing the gradient of the joint loss with respect to the generator of synthetic negatives for both approaches.For brevity, we omit the spatial locations and loss modulation hyperparameters.
Adversarial outlier-aware learning [26] jointly optimizes the zero-sum game between the generator G ψ and the discriminator D ϕ , closed-set classification P θ , and the confidence objective that enforces uncertain classification in the negative data points [26]: We denote the true data distribution as p * while F corresponds to the chosen f-divergence.The gradient of the joint loss (8) w.r.t. the generator parameters ψ vanishes in the first and the third term.The remaining terms enforce that the generated samples fool the discriminator and yield high-entropy closedset predictions: However, fooling the discriminator does not imply distributional coverage.In fact, the adversarial objective may cause mode collapse [76] which is detrimental to sample variability.Our joint learning objective (5) optimizes the likelihood of inlier samples, the closed-set classification loss, and low confidence in synthetic negatives: L(γ, θ) = − p * (x) ln p γ (x) dx − p * (y, x) ln P θ (y|x) dy dx + p ψ (x)F(P θ , U) dx.(10) The gradient of the loss (10) w.r.t. the normalizing flow parameters γ vanishes in the second term.The remaining terms enforce that the generated samples cover all modes of p * and, as before, yield high-entropy discriminative predictions: The resulting gradient entices the generative model to produce samples along the border of the inlier distribution.Consequently, we say that our synthetic negatives are coverageoriented.The presented analysis holds for any generative model that optimizes the density of the training data.Experimental evaluations in the following sections provide conclusive empirical confirmation for the advantages of synthetic negatives generated by normalizing flow (cf.Table VIII).

IV. EXPERIMENTAL SETUP
This section describes our experimental setup for dense outof-distribution detection.We review the employed datasets, introduce performance metrics, and describe the training details.

A. Benchmarks and Datasets
Benchmarks for dense OOD detection in road-driving scenes have experienced substantial progress in recent years (cf. Figure 5).In parallel, significant effort has been invested into artificial datasets by leveraging simulated environments [58], [77].Similarly, remote-sensing segmentation datasets have grown both in size in complexity [16]. .This was improved by carefully choosing pasting locations and postprocessing [12].Recent work ensures outliers match the environment by selecting real-world scenes [11].
WD-Pascal [78] has been created by pasting Pascal VOC [1] objects into WildDash [9] images.The resulting dataset allows for evaluating outlier detection in demanding conditions.However, the random pasting policy disturbs the scene layout as shown in Figure 5 (left).Consequently, there is a concern that such anomalies may be easier to detect.
Fishyscapes [12] evaluates model's ability to detect outliers in urban driving scenarios.The benchmark consists of two datasets: FS LostAndFound and FS Static.FS LostAndFound is a small subset of original LostAndFound [79] which contains small objects on the roadway (e.g.toys, boxes or car parts that could fall off).FS Static contains Cityscapes validation images overlaid with Pascal VOC objects.The objects are positioned according to the perspective and further postprocessed to obtain smoother OOD injection.
SegmentMeIfYouCan (SMIYC) [11] quantifies dense outlier detection performance in multiple setups.The benchmark consists of three datasets: AnomalyTrack, ObstacleTrack and LostAndFound [79].AnomalyTrack provides large anomalous objects which are fully aligned with the environment.For instance, they have a leopard in the middle of a dirt road as shown in Fig. 5 (right).LostAndFound [79] tests detection of small hazardous objects (e.g.boxes, toys, car parts, etc.) in urban scenes.Finally, ObstacleTrack tests detection of small objects on various road types.Inconsistent road surfaces can trick the detector and increase the false positive rate.ObstacleTrack and LostAndFound measure OOD detection performance solely on the driving surface while Anomaly-Track considers the detection across the whole image.Consequently, SMIYC provides a solid notion on OOD segmentation performance of a model deployed in the wild.
StreetHazards [58] is a synthetic dataset created with the CARLA game engine.The dataset captures simulated urban environment with carefully inserted anomalous objects (e.g. a horse carriage or a helicopter on the road).Simulating anomalies in virtual environments is appealing due to high flexibility in positioning and appearance, as well as low cost of data accumulation.Unfortunately, there is a notable quality mismatch between simulated environments and the real world.Still, this approach has great potential for evaluating outlieraware segmentation due to cheap ground truth with K+1 classes.We use StreetHazards for measuring outlier-aware segmentation performance according to open-mIoU [80].
BSB [16] is a remote sensing dataset with aerial images of Brasilia.It contains 3400 labeled images of 512×512 pixels.The official split designates 3000 train, 200 validation and 200 test images.The labels include 3 stuff classes (street, permeable area, and lake) and 11 thing classes (e.g.swimming pool, vehicle, sports court).We extract boat and harbour into the OOD test set.The resulting BSB-OOD dataset contains 2840 training images with 12 inlier classes, while the OOD test set contains 184 images.This setup is similar to [28], [58], [81] that also select a subset of classes as OOD samples.Note that there are other remote sensing datasets such as Vaihingen and Potsdam from the International Society for Photogrammetry and Remote Sensing (ISPRS).However, these datasets have fewer labels and an order of magnitude fewer images.Also, the So2Sat LCZ42 dataset [82] contains only small-resolution images and image-level labels.Hence, we opt for larger dataset and better performance estimates.

B. Metrics
We measure OOD segmentation performance using average precision (AP) [1], false-positive rate at true-positive rate of 95% (FPR 95 ) [40] and AUROC.AP is well suited for measuring OOD detection performance since it emphasizes the minority class [11], [12], [83].A perfect OOD detector would have AP equal to one.Likewise, FPR 95 is significant for real-world applications since high false-positive rates would require a large number of human interventions in practical deployments and therefore severely diminish the practical value of an autonomous system.We measure outlier-aware segmentation performance by open-mIoU [80].Open-mIoU penalizes outliers being recognized as inliers and inliers being wrongly detected as outliers.Compared to mIoU over K+1 classes, open-mIoU does not count true positive outlier predictions and averages over K instead of K+1 classes.Open-mIoU performance of an outlier-aware segmentation model with ideal OOD detection would be equal to the closed-set mIoU of the same model.Hence, the difference between the two metrics quantifies the performance gap caused by the presence of outliers [80].

C. Implementation details
All our models are based on Ladder DenseNet-121 (LDN-121) due to memory efficiency and fast experimentation [84].However, our framework can accommodate any other dense prediction architecture.All our experiments consist of two training stages.In both stages we utilize Cityscapes [85], Vistas [86] and Wilddash 2 [9].These three datasets contain 25 231 images.The images are resized to 1024 pixels (shorter edge), randomly flipped with the probability of 0.5, randomly resized in the interval [0.5, 2], and randomly cropped to 768 × 768 pixels.We optimize our models with Adam.In the first stage we train for 25 epochs without synthetic negatives.We use batch size 16 as validated in previous work [84].The starting learning rate is set to 10 −4 for the feature extractor and 4 • 10 −4 for the upsampling path.The learning rate is annealed according to a cosine schedule to the minimal value of 10 −7 which would have been reached in the 50th epoch.
In the second stage, we train for 15 epochs on mixed-content images (cf.section III-A).In this stage, we use a batch size of 12 due to limited GPU memory.
We did not use gradient accumulation due to batch normalization layers.Instead, we opted for gradient checkpointing [84], [87], [88].The initial learning rate is set to 1 • 10 −5 for the upsampling path and 2.5 • 10 −6 for the backbone.Once more the learning rate is decayed according to the cosine schedule to the value of 10 −7 .We set the hyperparameter λ to 3 • 10 −2 .This value is chosen so that the closed-set segmentation performance is not reduced.
We generate rectangular synthetic samples with dimensions from U(16, 216) by leveraging DenseFlow-25-6 [74].The flow is pretrained on random 64 × 64 crops from Vistas.We train the flow with the Adamax optimizer with learning rate set to 10 −6 .In the case of WD-Pascal, we train our model only on Vistas in order to achieve a fair comparison with the previous work [21].In the case of StreetHazards, we train on the corresponding train subset for 80 epochs on inlier images and 40 epochs on mixed-content images.In the case of Fishyscapes, we train exclusively on Cityscapes.We train for 150 epochs during stage 1 (inliers) and 50 epochs during stage 2 (mixed content).In the case of BSB-OOD dataset, we train LDN-121 for 150 epochs with a batch size of 16 on inlier images and then fine-tune on mixed-content images for 40 epochs.We sample synthetic negatives with dimensions from U (16,64).The flow was pre-trained on 32 × 32 random inlier crops of BSB-OOD images for 2k epochs with batch size of 256.All other hyperparameters are kept constant across all experiments.Each experiment lasts for approximately 38 hours on a single NVIDIA RTX A5000.

V. EXPERIMENTAL EVALUATION
We evaluate OOD detection performance of NFlowJS on road-driving scenes and aerial images.Experiments on roaddriving images suggest that our synthetic negatives can deliver comparable performance to real negatives (Sec.V-A).Furthermore, our synthetic negatives become a method of choice in setups with a large domain gap towards candidate datasets for sourcing real negative training samples (Sec.V-B).
We compare our performance with respect to contemporary methods which do not require the negative dataset or image resynthesis.Still, we list all methods in our tables, so we can discuss our method in a broader context.We also analyze the sensitivity of our method with respect to the distance of the OOD object from the camera.Finally, we measure the computational overhead of our method with respect to the baseline and visualize our synthetic samples.
A. Dense out-of-distribution detection in road-driving scenes Table I presents performance on WD-Pascal averaged over 50 runs [21].All methods have been trained on Vistas datasets and achieve similar mIoU performance.Column Aux data indicates whether the method trains on real negative data.We choose ADE20k for this purpose since it offers instance-level ground truth.The bottom section compares our method with early approaches: MC dropout [18], ODIN [32], and maxsoftmax [40].These approaches are not competitive with the current state-of-the-art.The top section shows that training with auxiliary negative data can significantly improve performance.However, our method closes the performance gap.It outperforms all other methods in FPR95 and AUROC metrics while achieving competitive AP.Table II presents performance evaluation on SMIYC [11] and Fishyscapes [12].Our method outperforms all previous methods on AnomalyTrack, ObstacleTrack as well as LAF-noKnown.We achieve such results despite refraining from image resynthesis [14], [19], [20], partial image reconstruction [15] or training on real negative images [12].Our method achieves very low FPR 95 (less than 1%) on ObstacleTrack and LostAndFound-noKnown.This is especially important for real-world applications where high incidence of false positives may make OOD detection useless.Note that ObstacleTrack includes small obstacles in front of a variety of road surfaces, which makes it extremely hard not to misclassify road parts as anomalies.Moreover, this dataset includes low-visibility images captured at dusk and other challenging evaluation setups.Our synthetic negative data also achieve competitive performance on FS LostAndFound.Our method outperforms others in terms of FPR 95 while achieving the second best AP.We slightly underperform only with respect to SynBoost which trains on real negative data and precludes real-time inference due to image resynthesis.In the case of FS Static dataset, our method achieves the best FPR 95 and the second best AP among the methods which do not train on auxiliary data.
We have also applied our method to a pre-trained third-party closed-set model and submitted the results to the Fishyscapes benchmark.We have chosen a popular DeepLabV3+ model which achieves high performance due to training on unlabeled video data [91].This choice promotes fair comparison, since the same model has also been used in several other benchmark submissions [15], [90].Please note that we use parameters which have not been trained on Cityscapes val in order to allow fair evaluation on FS Static.This result clearly shows that our method can also be applied to third-party models and deliver strong results.
Figure 6 shows qualitative performance on two sequences of images from SMIYC LostAndFound.Road ground truth is designated in grey and the detected obstacles are in yellow.The top sequence contains obstacles which change position through time.The bottom sequence contains multiple anomalous objects.Our method succeeds to detect a toy car and cardboard boxes even though no such objects were present during the training.Column 1 contains distant obstacles so please zoom in for better visibility.Table III shows OOD detection and outlier-aware semantic segmentation on StreetHazards.We produce outlier-aware semantic predictions by correcting closed-set predictions with our dense OOD map (Sec.III-C).We validate the OOD threshold in order to achieve TPR=95% [80] and measure performance according to mIoU over K+1 classes as well as with open-mIoU [80].To the best of our knowledge, our method outperforms all previous work.In particular, our method is better than methods which utilize auxiliary negative datasets [21], [33], [92] and the method based on image resynthesis [49].We note that there is still a significant performance degradation in presence of outliers.Closed-set performance is more than 65% mIoU, while outlier-aware performance peaks at 45%.Future research should strive to close this gap to provide safer segmentation in the wild.We implemented [32], [33], [92], [95] into our codebase according to official implementations.For the energy finetuning, we have conducted hyperparameter search as suggested in [92]: m in ∈ {−15, −23, −27} and m out ∈ {−5, −7}.The optimal values for dense setup are m in = −15 and m out = −5.We have validated ReAct [95] for c ∈ {0.9, 0.95, 0.99}.The best results are obtained with c = 0.99.
Figure 7 compares outlier-aware semantic segmentation performance of the proposed method with respect to the max-logit baseline [58] on StreetHazards.Anomalous pixels are designated in cyan.Our method reduces the number of false positives.However, safe and accurate outlier-aware segmentation is still an open problem.

B. Dense out-of-distribution detection in remote sensing
We compare our method with standard baselines for OOD detection [33], [40], [92] as well as with methods specifically developed for OOD detection in remote sensing imagery [17], [25].Table IV shows the performance on the BSBaerial-OOD dataset [16].Some methods train on real negative data (cf.Aux data).The top section presents several OOD detection baselines.We observe that training with real negative samples outperforms the MSP baseline [40] but underperforms with respect to our synthetic samples.This is not surprising

RGB input
Max-logit NFlowJS Ground truth Fig. 7. Outlier-aware segmentation on StreetHazards.The detected outliers are marked with cyan.Our method reduces the number of false positives over max-logit baseline.Still, further research is required in order to achieve closed-set performance in presence of outliers.
since the pasted negative instances involve a different camera perspective than aerial imagery.The middle section presents methods that are explicitly designed for aerial images.Morph-OpenPixel (MOP) [17] erodes the prediction confidence at object boundaries with morphological filtering.Morphological filtering improves FPR 95 but impairs AP with respect to the MSP baseline.DPN − [25] achieves runner-up AUROC and FPR 95 performance.The bottom part shows the performance of our model.JSDiv is the same as NFlowJS except that it uses negatives from ADE20k instead of synthetic ones.NFlowJS generates dataset-specific negatives along the border between the known and the unknown.NFlowJS outperforms methods which train on real negative data, indicating that synthetic negatives may be a method of choice when an appropriate negative dataset is unavailable.

C. Sensitivity of OOD detection to depth
Self-driving applications challenge us to detect anomalies as soon and as far as possible.However, distant anomalies are harder to detect due to being represented with fewer pixels.We analyze influence of depth to dense OOD detection on the LostAndFound dataset [79].The LAF test set consists of 1203 images with the corresponding pixel-level disparity maps and calibration parameters of the stereo rig.Due to limitations in the available disparity, we perform analysis in the range from 5 to 50 meters.Figure 9 shows histograms of inlier and outlier pixels.More than 60% of anomalous pixels are closer than 15 meters.Hence, the usual metrics (AP and FPR 95 ) are biased towards closer ranges.As we further demonstrate, many methods fail to detect anomalies at larger depths.We compare our method with the max-logit (ML) and maxsoftmax [40] baselines, ODIN [32], SynBoost [19] and OOD head [21].Table V shows that our method produces low false positive rate even at high distances.For example, at distances higher than 20 meters we outperform others by a wide margin.This finding is consistent with Fig. 6 which shows accurate detection of anomalies at larger distances.

D. Inference speed
A convenient dense OOD detector should not drastically increase already heavy computational burden of semantic segmentation.Hence, we measure computational overhead of our method and compare it with other approaches.We measure the inference speed on NVIDIA RTX 3090 for 1024 × 2048 inputs.Table VI shows that SynBoost [19] and SynthCP [49] are not applicable for real-time inference due to significant overhead.The baseline model LDN-121 [84] achieves near real-time inference for two megapixel images (46.5 ms, 21.5 FPS).ODIN [32] requires an additional forward-backward pass in order to recover the gradients of the loss with respect to the image.This results in a 3-fold slow-down with respect to the baseline.Similarly, MC Dropout [18] requires K forward passes for prediction with K MC samples.This results in 45.8 ms overhead when K=2. Contrary, our approach increases the inference time for only 7.8 ms with respect to the baseline while outperforming all previous approaches.The SynthCP measurements are taken from [90].

E. Visualization of synthetic outliers
Our method is able to generate samples at multiple resolutions with the same normalizing flow.The generated samples have a limited variety when compared to a typical negative dataset such as ImageNet or COCO [21], [50].Still, training with them greatly reduces overconfidence since the model is explicitly trained to produce uncertain predictions in outliers.

F. Synthetic negatives and the separation in the feature space
Up to now, we have considered softmax-activated models.However, softmax can assign arbitrarily large probabilities regardless of the distance from the closest training datum in the feature space [55], which fails to bound open-space risk [51], [62].We analyze the usefulness of synthetic negatives in open-set recognition by considering a popular baseline that is denoted as max-logit [57]- [59].Max-logit value is proportional to the projection of the feature vector of a given sample onto the closest class prototype vector.This value can be thresholded to bound the open space risk [62], [96].
The left column of Figure 12 shows histograms of max-logit values for known and unknown pixels on Fishyscapes val.The right column shows the same histograms after fine-tuning with our synthetic negative samples.The figure shows that training with our synthetic negatives increases the separation between known and unknown pixels in feature space, and improves the AUROC score.Similar effects have been reported after training with real negative data [60], [61].However, as argued before, our approach avoids the bias towards test anomalies that are related to the training data.Furthermore, it offers a great alternative for non-standard domains as shown in Table IV.Hence, the proposed method appears as a promising component of future approaches for dense open-set recognition.

VI. ABLATIONS
We ablate the impact of loss in negative pixels, the choice of generative model, the impact of pretraining, as well as the impact of temperature scaling on dense OOD detection.

A. Impacts of the loss function and OOD score
Table VII analyzes the impact of the loss function L neg and the OOD score s δ on AnomalyTrack val and Obstacle-Track val.The two chosen datasets feature large and small anomalies, respectively.We separately validate the modulation factor λ for each choice of the negative loss, as well as the temperature parameter.We use T=10 for max-softmax and T=2 for divergence-based scoring functions.We report average performance over last three epochs.Row 1 shows the standard setting with KL divergence as L neg and max-softmax as the OOD score [26], [33].Row 2 uses KL divergence both as the loss function and the OOD score.Row 3 features the reverse KL divergence.Minimizing the reverse divergence between the uniform distribution and the softmax distribution is equivalent to maximizing the softmax entropy [50].Rows 4 and 5 feature the JS divergence.We observe that the reversed KL divergence outperforms the standard KL-MSP setup in 3 out of 4 metrics.However, JS divergence substantially outperforms all alternatives both as the loss function (JSD-MSP vs KL-MSP) and as the OOD score (JSD-JSD vs JSD-MSP and RKL-RKL).We explain this advantage with robust response in synthetic outliers which resemble inliers, as well as with improved consistency during training and scoring (cf.sections III-B and III-C).Table VIII compares synthetic negative data generated by normalizing flow with synthetic negative data generated by GAN [26] and synthetic negative pre-logit features generated by GMM [44].Interestingly, training on synthetic OOD features produced by GMM achieves better average precision than synthetic negative images generated by GAN.Still, generating synthetic negatives with a normalizing flow outperforms both GAN images and GMM features.This advocates for the advantages of maximum likelihood over adversarial training for the generation of synthetic negatives, as described in Sec.III-D.

C. Impact of pre-training
Table IX explores the impact of pre-training to OOD detection performance.Row 1 shows the performance when neither generative nor discriminative model are trained prior to the joint training (Section III-A).In this case, we jointly train both models from their random initializations.Row 2 reveals that discriminative pre-training improves OOD detection.Introducing the synthetic negatives after discriminative pre-training improves generalization.Row 3 shows that pretraining both models generalizes even better.

D. Impact of temperature scaling
Table X shows the impact of softmax recalibration to OOD detection.The table explores three different temperatures.We observe that temperature scaling significantly improves Jensen-Shannon scoring.We also note that utilizing RealNVP [27] instead of DenseFlow [74] decreases OOD detection performance.

VII. CONCLUSION
We have presented a novel method for dense OOD detection and outlier-aware semantic segmentation.Our method trains on mixed-content images obtained by pasting synthetic negative patches into training images.We produce synthetic negatives by sampling a generative model which is jointly trained to maximize the likelihood and to give rise to uniform predictions at the far end of the discriminative model.Such collaborative learning leads to conservative outlier-aware predictions which are suitable for OOD detection and outlier-aware semantic segmentation.
We extend the previous work with the following consolidated contributions.First, we replace the adversarial generative model (GAN) with a normalizing flow.We believe that the resulting improvement is due to better coverage of the training distribution.Second, we extend the collaborative training setup for dense prediction.Generative flows are especially well-suited for this task due to straightforward generation at different resolutions.Third, we improve the performance by pre-training the normalizing flow and the discriminative model prior to joint training.Fourth, we propose to use JS divergence as a robust criterion for training a discriminative model with synthetic negatives.We also show that the same criterion can be used as a principled and competitive replacement for ad-hoc scoring functions such as max-softmax.
We have evaluated the proposed method on standard benchmarks and datasets for dense OOD detection and outlier-aware segmentation.The results indicate a significant advantage with respect to all previous approaches on the majority of the datasets from two different domains.The advantage becomes substantial in the case of non-standard domains with few suitable auxiliary datasets for sampling real negative data.Additionally, we demonstrate a great potential of our method for real-world deployments due to minimal computational overhead.Suitable avenues for future work include extending our method to setups with bounded open-set risk and other dense prediction tasks.

Fig. 1 .
Fig. 1.The proposed training setup.The normalizing flow generates the synthetic negative patch x − which we paste atop the raw inlier image.The resulting mixed-content image x ′ is fed to the dense classifier which is trained to discriminate inlier pixels (L cls ) and to produce uniform predictions in negative pixels (Lneg).This formulation enables gradient flow from Lneg to the normalizing flow while maximizing the likelihood of inlier patches (Lgen).

Fig. 3 .
Fig. 3. Left: f-divergences towards the uniform distribution in a two-class setup.Jensen-Shannon offers the most robust response.Right: Histograms of λLneg in synthetic negatives at the beginning of joint fine-tuning.The modulation factors λ have been separately validated for each of the three choices of Lneg.Jensen-Shannon divergence produces a more uniform learning signal than other f-divergences and avoids a high variety of Lneg.

Figure 4
Figure 4 summarizes inference according to the proposed method for outlier-aware semantic segmentation.The input image is fed into the discriminative model.The produced logits are fed into two branches.The top branch delivers closed-set

FuseFig. 4 .
Fig.4.Dense outlier-aware inference.We infer dense logits with a closed-set model.We recover the dense OOD map according to our divergence-based score (JSD).Closed-set predictions are overridden in the outlier-aware output wherever the OOD score exceeds the threshold δ.

Fig. 5 .
Fig. 5. Development of dense OOD detection in road-driving through time.Early work pastes objects at random locations[78].This was improved by carefully choosing pasting locations and postprocessing[12].Recent work ensures outliers match the environment by selecting real-world scenes[11].

Fig. 6 .
Fig. 6.OOD detection on LostAndFound dataset.Our method can detect obstacles at different distances from the camera (top) as well as multiple obstacles in one image (bottom).Road ground truth is designated in grey and the predicted OOD in yellow.Please zoom in to see the distant obstacles .

Figure 8
Figure 8 visualizes our performance on the BSB-aerial-OOD dataset.The left column shows the input images.The center column shows OOD objects -harbour and boats.The right column shows that NFlowJS delivers well-aligned score.

Figure 10
Figure 10 shows synthetic outliers generated by our normalizing flow after joint training on aerial images.

Figure 11
Figure11shows samples of a normalizing flow after joint training on road-driving scenes.Comparison with Figure10reveals that the appearance of our synthetic negative samples strongly depends on the underlying inlier dataset.Samples from Figure10resemble lakes and forests while samples from Figure11resemble road, sky, cars and buildings.These observations do not come up as a surprise since our normalizing flows are trained to generate data points along the border of the inlier distribution (cf.Fig.2).In other words, our method patches the open-space risk of a particular segmentation model by adapting the synthetic negative data to the training dataset.

Fig. 11 .
Fig. 11.Samples of DenseFlow-25-6 after joint training on road-driving images as proposed in section III-A.

Fig. 12 .
Fig. 12. Training on synthetic negative data improves the separation between test inliers and test outliers in the feature space.

TABLE II DENSE
OUT-OF-DISTRIBUTION DETECTION PERFORMANCE ON SEGMENTMEIFYOUCAN AND FISHYSCAPES.Cityscapes val.We do not show these results in Table II in order to keep the same model across all assays.

TABLE IV PERFORMANCE
EVALUATION ON IMAGES FROM BSB-OOD.

TABLE V ANALYSIS
OF FPR 95 AT VARIOUS DISTANCES FROM THE CAMERA.

TABLE VI COMPARISON
OF INFERENCE SPEED ON 2MPIX IMAGES AND RTX3090.

TABLE VII VALIDATION
OF THE LOSS IN NEGATIVE PIXELS AND THE OOD SCORE.

TABLE VIII IMPACT
OF GENERATIVE MODEL TO OOD DETECTION PERFORMANCE.