Recent Advances in Machine Learning Methods for Imperfect Large-Scale Data

A special issue of Big Data and Cognitive Computing (ISSN 2504-2289).

Deadline for manuscript submissions: 30 September 2025 | Viewed by 1330

Special Issue Editors


E-Mail Website
Guest Editor
College of Computer Sciences and Technology, Jilin University, Changchun 130000, China
Interests: machine learning; natural language processing
School of Computer Science and Artificial Intelligence, Liaoning Normal University, Dalian 116081, China
Interests: machine learning; computer vision

E-Mail Website
Guest Editor
College of Computer Sciences and Technology, Jilin University, Changchun 130000, China
Interests: weakly supervised learning; natural language processing

Special Issue Information

Dear Colleagues,

In recent years, large-scale datasets have become a crucial foundation for machine learning-based research and applications. However, the data available in real applications are often imperfect, containing noise, missing values, and imbalanced classes, to name just a few. Data continuously accumulate over time, leading to new challenges, such as expanding label spaces, shifting statistical properties, and increasing model training costs. Extracting valuable information from such imperfect large-scale data while optimizing model efficiency has become a research hotspot in the field of machine learning.

This Special Issue aims to collect and showcase the latest research achievements in handling and analyzing imperfect large-scale data, with a focus on innovative methods for improving model efficiency. Topics of interest include, but are not limited to, the following:

  1. Machine learning algorithms with noisy data;
  2. Machine learning algorithms with missing data;
  3. Machine learning algorithms with imbalanced data;
  4. Machine learning algorithms with incomplete data;
  5. Training methods for solving catastrophic forgetting;
  6. Machine learning algorithms for solving concept drift;
  7. Efficient training methods for large-scale data;
  8. Privacy preservation and security.

We look forward to receiving your contributions.

Dr. Ximing Li
Dr. Bo Fu
Dr. Changchun Li
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Big Data and Cognitive Computing is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1800 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • data cleaning
  • missing data
  • imbalanced data
  • weakly supervised learning
  • incremental learning
  • concept drift
  • lightweight model
  • privacy preservation

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • Reprint: MDPI Books provides the opportunity to republish successful Special Issues in book format, both online and in print.

Further information on MDPI's Special Issue policies can be found here.

Published Papers (2 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

25 pages, 3025 KB  
Article
QiGSAN: A Novel Probability-Informed Approach for Small Object Segmentation in the Case of Limited Image Datasets
by Andrey Gorshenin and Anastasia Dostovalova
Big Data Cogn. Comput. 2025, 9(9), 239; https://doi.org/10.3390/bdcc9090239 - 18 Sep 2025
Viewed by 89
Abstract
The paper presents a novel probability-informed approach to improving the accuracy of small object semantic segmentation in high-resolution imagery datasets with imbalanced classes and a limited volume of samples. Small objects imply having a small pixel footprint on the input image, for example, [...] Read more.
The paper presents a novel probability-informed approach to improving the accuracy of small object semantic segmentation in high-resolution imagery datasets with imbalanced classes and a limited volume of samples. Small objects imply having a small pixel footprint on the input image, for example, ships in the ocean. Informing in this context means using mathematical models to represent data in the layers of deep neural networks. Thus, the ensemble Quadtree-informed Graph Self-Attention Networks (QiGSANs) are proposed. New architectural blocks, informed by types of Markov random fields such as quadtrees, have been introduced to capture the interconnections between features in images at different spatial resolutions during the graph convolution of superpixel subregions. It has been analytically proven that quadtree-informed graph convolutional neural networks, a part of QiGSAN, tend to achieve faster loss reduction compared to convolutional architectures. This justifies the effectiveness of probability-informed modifications based on quadtrees. To empirically demonstrate the processing of real small data with imbalanced object classes using QiGSAN, two open datasets of synthetic aperture radar (SAR) imagery (up to 0.5 m per pixel) are used: the High Resolution SAR Images Dataset (HRSID) and the SAR Ship Detection Dataset (SSDD). The results of QiGSAN are compared to those of the transformers SegFormer and LWGANet, which constitute a new state-of-the-art model for UAV (Unmanned Aerial Vehicles) and SAR image processing. They are also compared to convolutional neural networks and several ensemble implementations using other graph neural networks. QiGSAN significantly increases the F1-score values by up to 63.93%, 48.57%, and 9.84% compared to transformers, convolutional neural networks, and other ensemble architectures, respectively. QiGSAN outperformed the base segmentors with the mIOU (mean intersection-over-union) metric too: the highest increase was 35.79%. Therefore, our approach to knowledge extraction using mathematical models allows us to significantly improve modern computer vision techniques for imbalanced data. Full article
Show Figures

Figure 1

16 pages, 7105 KB  
Article
A Self-Attention CycleGAN for Unsupervised Image Hazing
by Hongyin Ni and Wanshan Su
Big Data Cogn. Comput. 2025, 9(4), 96; https://doi.org/10.3390/bdcc9040096 - 11 Apr 2025
Viewed by 863
Abstract
The high cost and difficulty of collecting real-world foggy scene images mean that automatic driving datasets produce limited images in bad weather and lead to deficient training in automatic driving systems, causing unsafe judgments and leading to traffic accidents. Therefore, to effectively promote [...] Read more.
The high cost and difficulty of collecting real-world foggy scene images mean that automatic driving datasets produce limited images in bad weather and lead to deficient training in automatic driving systems, causing unsafe judgments and leading to traffic accidents. Therefore, to effectively promote the safety and robustness of an autonomous driving system, we improved the CycleGAN model to achieve dataset augmentation of foggy images. Firstly, by combining the self-attention mechanism and the residual network architecture, the sense of hierarchy of the fog effect in the synthesized image was significantly refined. Then, LPIPS was employed to adjust the calculation method for cycle consistency loss to make the synthetic picture more similar to the original one in terms of perception. The experimental results showed that the FID index of the foggy image generated by the improved CycleGAN network was reduced by 3.34, the IS index increased by 15.8%, and the SSIM index increased by 0.1%. The modified method enhances the generation of foggy images, while retaining more details of the original image and reducing content distortion. Full article
Show Figures

Figure 1

Back to TopTop