You are currently viewing a new version of our website. To view the old version click .

Data

Data is a peer-reviewed, open access journal on data in science, with the aim of enhancing data transparency and reusability.
The journal publishes in two sections: a section on the collection, treatment and analysis methods of data in science; a section publishing descriptions of scientific and scholarly datasets (one dataset per paper). The journal is published monthly online by MDPI.
Quartile Ranking JCR - Q2 (Multidisciplinary Sciences)

All Articles (1,267)

A Real-World Underwater Video Dataset with Labeled Frames and Water-Quality Metadata for Aquaculture Monitoring

  • Osbaldo Aragón-Banderas,
  • Leonardo Trujillo and
  • Yolocuauhtli Salazar
  • + 2 authors

Aquaculture monitoring increasingly relies on computer vision to evaluate fish behavior and welfare under farming conditions. This dataset was collected in a commercial recirculating aquaculture system (RAS) integrated with hydroponics in Queretaro, Mexico, to support the development of robust visual models for Nile tilapia (Oreochromis niloticus). More than ten hours of underwater recordings were curated into 31 clips of 30 s each, a duration selected to balance representativeness of fish activity with a manageable size for annotation and training. Videos were captured using commercial action cameras at multiple resolutions (1920 × 1080 to 5312 × 4648 px), frame rates (24–60 fps), depths, and lighting configurations, reproducing real-world challenges such as turbidity, suspended solids, and variable illumination. For each recording, physicochemical parameters were measured, including temperature, pH, dissolved oxygen and turbidity, and are provided in a structured CSV file. In addition to the raw videos, the dataset includes 3520 extracted frames annotated using a polygon-based JSON format, enabling direct use for training object detection and behavior recognition models. This dual resource of unprocessed clips and annotated images enhances reproducibility, benchmarking, and comparative studies. By combining synchronized environmental data with annotated underwater imagery, the dataset contributes a non-invasive and versatile resource for advancing aquaculture monitoring through computer vision.

18 December 2025

Schematic of the recirculating aquaponic system (RAS) with rearing, solids removal, biofilters, aeration, pumps, and hydroponic grow beds; solid arrows indicate water flow, and dashed arrows indicate air flow.

Labels4Rails: A Railway Image Annotation Tool and Associated Reference Dataset

  • Tina Hiebert,
  • Florian Hofstetter and
  • Carsten Thomas
  • + 3 authors

The development of autonomous train systems relies heavily on machine learning (ML) models, which in turn depend on large, high-quality annotated datasets for training and evaluation. The railway domain lacks adequate public datasets and efficient annotation tools. To address this gap, we present Labels4Rails, a tool designed specifically for the annotation of railway scenes. It captures track topology, switch states including switch directions, and informational tags regarding the images’ content and leverages consistent camera perspectives and the fixed track geometries inherent to railways for annotation efficiency. We used Labels4Rails to create the L4R_NLB reference dataset from Norwegian railway footage. The dataset contains 10,253 annotated images across four seasons, including 1415 switch annotations. Both the tool and dataset are publicly available.

16 December 2025

Labels4Rails user interface with annotated ego track (yellow) and right neighbor track (green).
  • Data Descriptor
  • Open Access

AlimurgITA: A Database of the Italian Alimurgic Flora

  • Piera Di Marzio,
  • Angela Di Iorio and
  • Carmen Giancola
  • + 1 author

The AlimurgITA portal is a user-friendly and effective tool for researching Wild Edible Plants (WEPs). It provides valuable information on alimurgic plant species, aiding conservation and potential applications (agricultural, food, etc.). Users can interact with authors to report errors and contribute to the knowledge base regarding local uses. The authors will update the site every six months to include new data. Currently, the online database contains data on 1116 taxa used in 20 Italian regions: updated scientific name and link to the site Acta Plantarum, family, main synonyms, common name in Italian and regional dialect, chorotype, life form, a map showing the regions where it is known to be used, the part used, how it is used, and the bibliography. From the home page, you can search for taxa by scientific name, and there are pages dedicated to summaries of the entries: scientific name, family, chorotype, life form, method of use, and part used. Additionally, within the FuD WE PIC Project, the AlimurgITA entity list is being integrated with Italian vegetation data from the European Vegetation Archive to model WEPs richness, identify diversity hotspots, and explore the relationship between WEPs diversity and habitat types.

16 December 2025

AlimurgITA online database: a screenshot of the page corresponding to the species Achillea ligustica All.

The development of Natural Language Processing applications tailored for diverse Arabic-speaking users requires specialized Arabic corpora, which are currently lacking in existing Arabic linguistic resources. Therefore, in this study, a multidialectal parallel Arabic corpus is built, focusing on the travel and tourism domain. By leveraging the text generation and dialectal transformation capabilities of Large Language Models, an initial set of approximately 100,000 parallel sentences was generated. Following a rigorous multi-stage deduplication process, 50,010 unique parallel sentences were obtained from Modern Standard Arabic (MSA) and five major Arabic dialects—Saudi, Egyptian, Iraqi, Levantine, and Moroccan. This study presents the detailed methodology of corpus generation and refinement, describes the characteristics of the generated corpus, and provides a comprehensive statistical analysis highlighting the corpus size, lexical diversity, and linguistic overlap between MSA and the five dialects. This corpus represents a valuable resource for researchers and developers in Arabic dialect processing and AI applications that require nuanced contextual understanding.

12 December 2025

Flowchart illustrating the overall process.

News & Conferences

Issues

Open for Submission

Editor's Choice

Reprints of Collections

Data Mining and Computational Intelligence for E-learning and Education
Reprint

Data Mining and Computational Intelligence for E-learning and Education

Editors: Antonio Sarasa Cabezuelo, Ramón González del Campo Rodríguez Barbero
Recent Advances and Applications in Partial Least Squares Structural Equation Modeling (PLS-SEM)
Reprint

Recent Advances and Applications in Partial Least Squares Structural Equation Modeling (PLS-SEM)

Editors: María del Carmen Valls Martínez, José-María Montero, Pedro Antonio Martín Cervantes

Get Alerted

Add your email address to receive forthcoming issues of this journal.

XFacebookLinkedIn
Data - ISSN 2306-5729