Unsupervised Class Generation to Expand Semantic Segmentation Datasets
Abstract
:1. Introduction
- We present a method that leverages the generative capabilities of Stable Diffusion alongside SAM’s accurate semantic masking to create high-quality synthetic objects with their corresponding segmentation masks.
- We demonstrate the practical applicability of our approach by successfully expanding existing synthetic datasets with additional classes without requiring architectural modifications to semantic segmentation methods.
- We show that models can effectively learn these newly generated classes within unsupervised domain adaptation (UDA) pipelines, achieving performance comparable to the original classes in the dataset.
1.1. Related Work
1.1.1. Generating Synthetic Data
1.1.2. Exploiting Synthetic Data
1.1.3. Segment Anything Module
2. Methods
2.1. Pipeline Definition
2.2. Mask Curation
2.3. Including Novel Classes in Unsupervised Domain Adaptation Pipelines for Semantic Segmentation
3. Results
3.1. Experimental Setup
3.2. Including New Classes in Datasets
3.3. Ablation Tests
3.3.1. Impact of Appearance Rate
3.3.2. Mask Filtering Evaluation
4. Discussion
Future Work
- Applying the proposed pipeline as an adversarial sample generator for hard classes where segmentation models perform poorly to improve model robustness.
- Extending the method to enable unsupervised generation of complete synthetic datasets by composing multiple class instances into randomized scenes, following the principles of domain randomization [36].
- Measuring how this technique can help with rare or under-represented classes by creating new samples to increase their intra-class variability and address class imbalance in long-tail distributions.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
UDA | Unsupervised domain adaptation |
SAM | Segment Anything Module |
References
- Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3212–3223. [Google Scholar] [CrossRef]
- Ros, G.; Sellart, L.; Materzynska, J.; Vazquez, D.; Lopez, A.M. The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3234–3243. [Google Scholar]
- Richter, S.R.; Vineet, V.; Roth, S.; Koltun, V. Playing for Data: Ground Truth from Computer Games. In Proceedings of the IEEE European Conference Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 102–118. [Google Scholar] [CrossRef]
- Schwonberg, M.; Niemeijer, J.; Termöhlen, J.A.; schäfer, J.P.; Schmidt, N.M.; Gottschalk, H.; Fingscheidt, T. Survey on Unsupervised Domain Adaptation for Semantic Segmentation for Visual Perception in Automated Driving. IEEE Access 2023, 11, 54296–54336. [Google Scholar] [CrossRef]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv 2022, arXiv:2112.10752. [Google Scholar]
- Jia, Y.; Hoyer, L.; Huang, S.; Wang, T.; Van Gool, L.; Schindler, K.; Obukhov, A. DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control. arXiv 2023, arXiv:2312.03048. [Google Scholar]
- Hoyer, L.; Dai, D.; Van Gool, L. DAFormer: Improving Network Architectures and Training Strategies for Domain-Adaptive Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 9924–9935. [Google Scholar]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 4015–4026. [Google Scholar]
- Dosovitskiy, A.; Ros, G.; Codevilla, F.; Lopez, A.; Koltun, V. CARLA: An Open Urban Driving Simulator. In Proceedings of the 1st Annual Conference on Robot Learning, Mountain View, CA, USA, 13–15 November 2017; pp. 1–16. [Google Scholar]
- Rong, G.; Shin, B.H.; Tabatabaee, H.; Lu, Q.; Lemke, S.; Možeiko, M.; Boise, E.; Uhm, G.; Gerow, M.; Mehta, S.; et al. Lgsvl simulator: A high fidelity simulator for autonomous driving. In Proceedings of the 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), Rhodes, Greece, 20–23 September 2020; IEEE: Toulouse, France, 2020; pp. 1–6. [Google Scholar]
- Shah, S.; Dey, D.; Lovett, C.; Kapoor, A. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Proceedings of the Field and Service Robotics: Results of the 11th International Conference, Zurich, Switzerland, 12–15 September 2017; Springer: Berlin/Heidelberg, Germany, 2018; pp. 621–635. [Google Scholar]
- Xiao, A.; Huang, J.; Guan, D.; Zhan, F.; Lu, S. Transfer learning from synthetic to real lidar point cloud for semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 22 February–1 March 2022; Volume 36, pp. 2795–2803. [Google Scholar]
- Wu, W.; Zhao, Y.; Chen, H.; Gu, Y.; Zhao, R.; He, Y.; Zhou, H.; Shou, M.Z.; Shen, C. Datasetdm: Synthesizing data with perception annotations using diffusion models. Adv. Neural Inf. Process. Syst. 2023, 36, 54683–54695. [Google Scholar]
- Hoyer, L.; Dai, D.; Van Gool, L. HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentation. In Proceedings of the IEEE European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 372–391. [Google Scholar]
- Hoyer, L.; Dai, D.; Wang, H.; Van Gool, L. MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
- Tranheden, W.; Olsson, V.; Pinto, J.; Svensson, L. DACS: Domain Adaptation via Cross-domain Mixed Sampling. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020; pp. 1378–1388. [Google Scholar]
- Nam, K.D.; Nguyen, T.M.; Dieu, T.V.; Visani, M.; Nguyen, T.O.; Sang, D.V. A Novel Unsupervised Domain Adaption Method for Depth-Guided Semantic Segmentation Using Coarse-to-Fine Alignment. IEEE Access 2022, 10, 101248–101262. [Google Scholar] [CrossRef]
- Wu, Y.; Hong, M.; Li, A.; Huang, S.; Liu, H.; Ge, Y. Self-Supervised Adversarial Learning for Domain Adaptation of Pavement Distress Classification. IEEE Trans. Intell. Transp. Syst. 2024, 25, 1966–1977. [Google Scholar] [CrossRef]
- Zhang, W.; Wang, J.; Wang, Y.; Wang, F.Y. ParaUDA: Invariant Feature Learning with Auxiliary Synthetic Samples for Unsupervised Domain Adaptation. IEEE Trans. Intell. Transp. Syst. 2022, 23, 20217–20229. [Google Scholar] [CrossRef]
- Gao, L.; Zhang, J.; Zhang, L.; Tao, D. DSP: Dual Soft-Paste for Unsupervised Domain Adaptive Semantic Segmentation. In Proceedings of the ACM International Conference on Multimedia (MM), Chengdu, China, 20–24 October 2021; pp. 2825–2833. [Google Scholar] [CrossRef]
- Yan, H.; Li, Z.; Wang, Q.; Li, P.; Xu, Y.; Zuo, W. Weighted and Class-Specific Maximum Mean Discrepancy for Unsupervised Domain Adaptation. IEEE Trans. Multimed. 2019, 22, 2420–2433. [Google Scholar] [CrossRef]
- Yan, H.; Ding, Y.; Li, P.; Wang, Q.; Xu, Y.; Zuo, W. Mind the Class Weight Bias: Weighted Maximum Mean Discrepancy for Unsupervised Domain Adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 945–954. [Google Scholar] [CrossRef]
- Fan, Q.; Shen, X.; Ying, S.; Du, S. OTCLDA: Optimal Transport and Contrastive Learning for Domain Adaptive Semantic Segmentation. In Proceedings of the IEEE Transactions on Intelligent Transportation Systems, Edmonton, AB, Canada, 24–27 September 2024; pp. 1–13. [Google Scholar] [CrossRef]
- Devika, A.K.; Sanodiya, R.K.; Jose, B.R.; Mathew, J. Visual Domain Adaptation through Locality Information. Eng. Appl. Artif. Intell. 2023, 123, 106172. [Google Scholar] [CrossRef]
- Tsai, Y.H.; Hung, W.C.; Schulter, S.; Sohn, K.; Yang, M.H.; Chandraker, M. Learning to Adapt Structured Output Space for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7472–7481. [Google Scholar] [CrossRef]
- Vu, T.H.; Jain, H.; Bucher, M.; Cord, M.; Pérez, P. ADVENT: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2512–2521. [Google Scholar] [CrossRef]
- Marcos-Manchón, P.; Alcover-Couso, R.; SanMiguel, J.C.; Martínez, J.M. Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 9242–9252. [Google Scholar]
- Tang, R.; Liu, L.; Pandey, A.; Jiang, Z.; Yang, G.; Kumar, K.; Stenetorp, P.; Lin, J.; Ture, F. What the daam: Interpreting stable diffusion using cross attention. arXiv 2022, arXiv:2210.04885. [Google Scholar]
- Krähenbühl, P.; Koltun, V. Efficient inference in fully connected crfs with gaussian edge potentials. Adv. Neural Inf. Process. Syst. 2011, 24. [Google Scholar]
- Polsby, D.D.; Popper, R.D. The third criterion: Compactness as a procedural safeguard against partisan gerrymandering. Yale L. Pol’y Rev. 1991, 9, 301. [Google Scholar] [CrossRef]
- Cox, E. A method of assigning numerical and percentage values to the degree of roundness of sand grains. J. Paleontol. 1927, 1, 179–183. [Google Scholar]
- Wang, Z.; Guo, S.; Shang, X.; Ye, X. Pseudo-label Assisted Optimization of Multi-branch Network for Cross-domain Person Re-identification. In Proceedings of the IEEE International Conference on Mechatronics and Automation (ICMA), Harbin, China, 6–9 August 2023; pp. 13–18. [Google Scholar]
- Montalvo, J.; Alcover-Couso, R.; Carballeira, P.; García-Martín, Á.; SanMiguel, J.C.; Escudero-Viñolo, M. Leveraging Contrastive Learning for Semantic Segmentation with Consistent Labels Across Varying Appearances. arXiv 2024, arXiv:2412.16592. [Google Scholar]
- Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
- Everingham, M.; Gool, L.V.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes (VOC) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
- Tobin, J.; Fong, R.; Ray, A.; Schneider, J.; Zaremba, W.; Abbeel, P. Domain randomization for transferring deep neural networks from simulation to the real world. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017; IEEE: Toulouse, France, 2017; pp. 23–30. [Google Scholar]
New Class | Road | Sidewalk | Building | Wall | Fence | Pole | Traffic Light | Traffic Sign | Vegetation | Sky | Person | Rider | Car | Truck | Bus | Train | Motorcycle | Bicycle | mIoU | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Synthia | - | 82.4 | 37.7 | 88.7 | 43.0 | 8.4 | 50.8 | 55.7 | 55.1 | 86.0 | 88.1 | 74.2 | 49.5 | 87.8 | - | 63.2 | - | 54.5 | 62.8 | 54.9 |
Train | 87.5 | 50.2 | 88.4 | 44.4 | 1.6 | 49.2 | 53.1 | 50.7 | 85.3 | 92.8 | 74.5 | 48.2 | 85.9 | - | 70.6 | 29.7 | 53.4 | 60.1 | 57.0 | |
Truck | 82.5 | 40.8 | 88.8 | 44.6 | 6.8 | 50.4 | 55.5 | 51.0 | 85.1 | 91.7 | 67.1 | 47.6 | 90.5 | 64.0 | 60.3 | - | 55.8 | 62.2 | 58.0 | |
Both | 86.5 | 47.1 | 88.3 | 44.4 | 4.4 | 49.9 | 54.1 | 54.8 | 86.6 | 93.0 | 73.1 | 41.2 | 86.3 | 38.4 | 49.2 | 52.0 | 53.5 | 60.8 | 59.1 | |
4AGT | - | 91.5 | 68.5 | 89.1 | 43.4 | 30.1 | 50.1 | 48.4 | 59.8 | 88.3 | 92.7 | 70.7 | 35.4 | 88.0 | 25.7 | - | - | 49.9 | 61.9 | 55.2 |
Train | 96.1 | 70.1 | 88.9 | 44.4 | 29.9 | 50.2 | 54.3 | 62.3 | 88.1 | 92.7 | 69.9 | 41.9 | 86.9 | 66.1 | - | 42.7 | 54.8 | 58.4 | 61.0 | |
Bus | 96.0 | 70.7 | 89.2 | 44.6 | 33.8 | 51.8 | 54.4 | 60.6 | 88.6 | 93.7 | 69.6 | 38.6 | 90.1 | 56.7 | 49.5 | - | 52.0 | 60.3 | 61.1 | |
Both | 95.9 | 70.5 | 87.5 | 33.7 | 25.9 | 51.0 | 53.0 | 57.6 | 88.3 | 93.2 | 70.4 | 43.1 | 85.5 | 35.4 | 65.2 | 65.5 | 55.1 | 61.0 | 63.2 |
Class | Filtering | Road | Sidewalk | Building | Wall | Fence | Pole | Traffic Light | Traffic Sign | Vegetation | Sky | Person | Rider | Car | Truck | Bus | Train | Motorcycle | Bicycle | mIoU | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Synthia | Train | ✗ | 85.9 | 45.6 | 88.5 | 45.7 | 8.3 | 50.1 | 54.3 | 48.8 | 86.8 | 90.6 | 73.1 | 40.2 | 89.4 | - | 57.3 | 0.0 | 49 | 54.3 | 53.8 |
✓ | 87.5 | 50.2 | 88.4 | 44.4 | 1.6 | 49.2 | 53.1 | 50.7 | 85.3 | 92.8 | 74.5 | 48.2 | 85.9 | - | 70.6 | 29.7 | 53.4 | 60.1 | 57.0 | ||
Truck | ✗ | 85.4 | 43.5 | 89.1 | 47.4 | 9.2 | 49.9 | 54.5 | 56.1 | 86.4 | 87.9 | 69.0 | 43.1 | 88.9 | 61.2 | 51.2 | - | 53.2 | 61.3 | 57.6 | |
✓ | 82.5 | 40.8 | 88.8 | 44.6 | 6.8 | 50.4 | 55.5 | 51.0 | 85.1 | 91.7 | 67.1 | 47.6 | 90.5 | 64.0 | 60.3 | - | 55.8 | 62.2 | 58.0 | ||
4AGT | Train | ✗ | 96.1 | 64.5 | 88.8 | 41.2 | 27.5 | 50.2 | 53.1 | 59.8 | 88.2 | 92.8 | 69.9 | 41.9 | 88.3 | 26.4 | - | 13.2 | 51.2 | 62.1 | 56.4 |
✓ | 96.1 | 70.1 | 88.9 | 44.4 | 29.9 | 50.2 | 54.3 | 62.3 | 88.1 | 92.7 | 69.9 | 41.9 | 86.9 | 66.1 | - | 42.7 | 54.8 | 58.4 | 61.0 | ||
Bus | ✗ | 95.4 | 70.3 | 88.7 | 44.2 | 33.2 | 52.3 | 52.1 | 61.3 | 88 | 93.4 | 68.3 | 44.4 | 87.5 | 47.0 | 28.3 | - | 49.1 | 63.2 | 59.3 | |
✓ | 96.0 | 70.7 | 89.2 | 44.6 | 33.8 | 51.8 | 54.4 | 60.6 | 88.6 | 93.7 | 69.6 | 38.6 | 90.1 | 56.7 | 49.5 | - | 52.0 | 60.3 | 61.1 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Montalvo, J.; García-Martín, Á.; Carballeira, P.; SanMiguel, J.C. Unsupervised Class Generation to Expand Semantic Segmentation Datasets. J. Imaging 2025, 11, 172. https://doi.org/10.3390/jimaging11060172
Montalvo J, García-Martín Á, Carballeira P, SanMiguel JC. Unsupervised Class Generation to Expand Semantic Segmentation Datasets. Journal of Imaging. 2025; 11(6):172. https://doi.org/10.3390/jimaging11060172
Chicago/Turabian StyleMontalvo, Javier, Álvaro García-Martín, Pablo Carballeira, and Juan C. SanMiguel. 2025. "Unsupervised Class Generation to Expand Semantic Segmentation Datasets" Journal of Imaging 11, no. 6: 172. https://doi.org/10.3390/jimaging11060172
APA StyleMontalvo, J., García-Martín, Á., Carballeira, P., & SanMiguel, J. C. (2025). Unsupervised Class Generation to Expand Semantic Segmentation Datasets. Journal of Imaging, 11(6), 172. https://doi.org/10.3390/jimaging11060172