You are currently viewing a new version of our website. To view the old version click .

Data

Data is a peer-reviewed, open access journal on data in science, with the aim of enhancing data transparency and reusability.
The journal publishes in two sections: a section on the collection, treatment and analysis methods of data in science; a section publishing descriptions of scientific and scholarly datasets (one dataset per paper). The journal is published monthly online by MDPI.
Quartile Ranking JCR - Q2 (Multidisciplinary Sciences)

All Articles (1,269)

  • Data Descriptor
  • Open Access

Advancements in data storage and data processing technologies has compelled higher education institutions to optimise the use of their data. Many universities globally have begun to implement learning analytics at their institutions to better understand and improve teaching and learning. African higher education institutions have been slow to implement learning analytics despite the continued accumulation of digital data. The research related to this study presents a dataset of Information Systems and Technology (IS&T) students from the University of KwaZulu-Natal, a South African university. The dataset comprises approximately 14,000 registered student records from 10 IS&T courses, primarily consisting of demographic data, academic performance (including past IS&T courses and school records), and Learning Management System (LMS) interaction data. The dataset exhibits an imbalance, characterised by a higher proportion of students who have successfully completed courses compared to those who have not. The dataset will be of interest to researchers engaged in learning analytics application studies, including early pass/fail prediction and grade classification, as well as those who want to test their techniques on a real-world dataset.

19 December 2025

Dataset hierarchy.

Constructed-response items offer rich evidence of writing proficiency, but the linguistic signals they contain vary with grade level. This study presents a cross-sectional analysis of 5638 English Language Arts essays from Grades 6–12 to identify which linguistic features predict proficiency and to characterize how their importance shifts across grade levels. We extracted a suite of lexical, syntactic, and semantic-cohesion features, and evaluated their predictive power using an interpretive dual-model framework combining LASSO and XGBoost algorithms. Feature importance was assessed through LASSO coefficients, XGBoost Gain scores, and SHAP values, and interpreted by isolating both consensus and divergences of the three metrics. Results show moderate, generalizable predictive signals in Grades 6–8, but no generalizable predictive power was found in the Grades 9–12 cohort. Across the middle grades, three findings achieved strong consensus. Essay length, syntactic density, and global semantic organization served as strong predictors of writing proficiency. Lexical diversity emerged as a key divergent feature, it was a top predictor for XGBoost but ignored by LASSO, suggesting its contribution depends on interactions with other features. These findings inform actionable, grade-sensitive feedback, highlighting stable, diagnostic targets for middle school while cautioning that discourse-level features are necessary to model high-school writing.

21 December 2025

A Real-World Underwater Video Dataset with Labeled Frames and Water-Quality Metadata for Aquaculture Monitoring

  • Osbaldo Aragón-Banderas,
  • Leonardo Trujillo and
  • Yolocuauhtli Salazar
  • + 2 authors

Aquaculture monitoring increasingly relies on computer vision to evaluate fish behavior and welfare under farming conditions. This dataset was collected in a commercial recirculating aquaculture system (RAS) integrated with hydroponics in Queretaro, Mexico, to support the development of robust visual models for Nile tilapia (Oreochromis niloticus). More than ten hours of underwater recordings were curated into 31 clips of 30 s each, a duration selected to balance representativeness of fish activity with a manageable size for annotation and training. Videos were captured using commercial action cameras at multiple resolutions (1920 × 1080 to 5312 × 4648 px), frame rates (24–60 fps), depths, and lighting configurations, reproducing real-world challenges such as turbidity, suspended solids, and variable illumination. For each recording, physicochemical parameters were measured, including temperature, pH, dissolved oxygen and turbidity, and are provided in a structured CSV file. In addition to the raw videos, the dataset includes 3520 extracted frames annotated using a polygon-based JSON format, enabling direct use for training object detection and behavior recognition models. This dual resource of unprocessed clips and annotated images enhances reproducibility, benchmarking, and comparative studies. By combining synchronized environmental data with annotated underwater imagery, the dataset contributes a non-invasive and versatile resource for advancing aquaculture monitoring through computer vision.

18 December 2025

Labels4Rails: A Railway Image Annotation Tool and Associated Reference Dataset

  • Tina Hiebert,
  • Florian Hofstetter and
  • Carsten Thomas
  • + 3 authors

The development of autonomous train systems relies heavily on machine learning (ML) models, which in turn depend on large, high-quality annotated datasets for training and evaluation. The railway domain lacks adequate public datasets and efficient annotation tools. To address this gap, we present Labels4Rails, a tool designed specifically for the annotation of railway scenes. It captures track topology, switch states including switch directions, and informational tags regarding the images’ content and leverages consistent camera perspectives and the fixed track geometries inherent to railways for annotation efficiency. We used Labels4Rails to create the L4R_NLB reference dataset from Norwegian railway footage. The dataset contains 10,253 annotated images across four seasons, including 1415 switch annotations. Both the tool and dataset are publicly available.

16 December 2025

News & Conferences

Issues

Open for Submission

Editor's Choice

Reprints of Collections

Data Mining and Computational Intelligence for E-learning and Education
Reprint

Data Mining and Computational Intelligence for E-learning and Education

Editors: Antonio Sarasa Cabezuelo, Ramón González del Campo Rodríguez Barbero
Recent Advances and Applications in Partial Least Squares Structural Equation Modeling (PLS-SEM)
Reprint

Recent Advances and Applications in Partial Least Squares Structural Equation Modeling (PLS-SEM)

Editors: María del Carmen Valls Martínez, José-María Montero, Pedro Antonio Martín Cervantes

Get Alerted

Add your email address to receive forthcoming issues of this journal.

XFacebookLinkedIn
Data - ISSN 2306-5729