Automatic Information Extraction from Scientific Publications Based on the Use Case of Additive Manufacturing
Abstract
1. Introduction
1.1. Motivation
1.2. Problem Statement
1.3. Challenges
- Heterogeneity of Document Layouts: Scholarly articles are published in a wide array of layouts, varying by journal, conference, or publisher. This diversity necessitates either highly generalized extraction systems or specialized solutions tailored to specific formats. Developing a universal system demands considerable resources for understanding and adapting to new layouts, potentially reducing efficiency and requiring ongoing maintenance.
- Adaptability to Diverse Document Structures: Scientific publications exhibit a wide range of layouts and formatting conventions. The tool must flexibly adapt to these variations, including handling different domain-specific terminologies, to ensure broad usability—especially within additive manufacturing research.
- Complexity of PDF Format and Structure: PDF documents, while widely used for their portability, often pose significant challenges for automated information extraction. Unlike structured formats such as XML, PDFs lack a consistent internal structure, making it difficult to preserve semantic relationships during extraction. Hidden layers within PDFs can contain content that is not visible in standard viewers, potentially resulting in incomplete or fragmented extraction of tokens and sentences. Especially with older PDFs created from scanned images, poor image quality may necessitate manual correction—a problem less prevalent in recent publications. To minimize these issues, this work focuses exclusively on modern, well-structured PDFs and excludes image-based files. Nonetheless, the problem of invisible data in hidden layers remains unresolved and is acknowledged as a limitation.
- Robust Data Preparation: Accurate extraction relies on effective data preparation. The system must be capable of distinguishing between text, tables, and figures, removing irrelevant layout elements, and normalizing text through methods such as stemming and lemmatization.
- Variability of Natural Language and Data Representation: A major obstacle is the diverse ways in which information is presented within scientific publications. Although English dominates as the primary language, data can appear in multiple forms—text, tables, figures, or charts. Both the linguistic choices and the format impact the accuracy and reliability of extraction systems. Effective solutions must be able to identify and adapt to these variations either prior to extraction or through robust post-processing mechanisms.
- Quality and Clarity of Scientific Publications: The overall quality of scientific writing directly influences extraction performance. Publications authored by non-native speakers may include grammatical or spelling errors, while ambiguous phrasing, misleading content, and complex sentence structures further complicate accurate information extraction. While these issues can be addressed, doing so would significantly increase the initial workload and is beyond the scope of the present work.
- Technical Language and Lack of Standardization: The inconsistent use of technical terms, symbols, and units (such as S.I. units) poses a significant challenge. Authors often follow individual conventions, leading to discrepancies that may result in extraction errors or missed parameters. Existing standards like UTF-8 encoding and common scientific symbols can help, but they are not universally applied, making consistent extraction across diverse literature difficult.
- High Precision in Information Extraction: The utility of the system hinges on the precise extraction of relevant information. This requires the implementation of domain-specific, pattern-based algorithms that can reliably account for the nuances of the target field while maintaining high accuracy.
- Necessity for Domain-Specific Expertise: The validation and meaningful interpretation of extracted information demand substantial domain-specific knowledge. In highly specialized fields such as additive manufacturing, materials science, or engineering, the absence of such expertise can hinder the assessment of data accuracy and relevance. Integrating domain knowledge into both the system’s extraction logic and its evaluation process is therefore essential to ensure robust and reliable outcomes.
- Limitations in Contextual Understanding: Automated extraction systems often struggle to interpret the context in which information appears. Extracted data points, when considered in isolation, can lead to misinterpretation or loss of meaning. While restricting extraction through detailed input parameters can partially address this, additional techniques are required to enhance the system’s ability to understand and refine content within its broader textual context.
- User Accessibility for Non-Experts: Many researchers do not possess advanced skills in text mining or information extraction. Therefore, the system must be highly user-friendly and intuitive, functioning as an out-of-the-box solution that requires minimal configuration while maintaining comprehensive capabilities.
- Operation on Standard Hardware: Given that most researchers rely on standard local computers rather than high-performance servers or cloud solutions, the system must be optimized for efficient operation on typical hardware. Achieving this balance between computational efficiency and the demands of document processing is a significant challenge.
- Rapid Processing and Timely Results: Researchers depend on timely access to insights from their data. The system must process selected PDF files swiftly, minimizing delays without sacrificing accuracy or the ability to handle large datasets. Algorithmic optimization is key to meeting these expectations.
1.4. Technological Chances
1.5. Main Objective
1.6. Research Questions
- RQ1: Automated Parameter Extraction: To what extent can parameters, such as physical ranges and material properties, be automatically extracted from scientific publications in the engineering domain?
- RQ3: Reliability of Extracted Information: How reliable is the extracted information, and what measures can be implemented to minimize false alarms or irrelevant data? Additionally, how can the system effectively handle and mitigate these inaccuracies?
- RQ2: Category-Specific Information Extraction: Is it feasible to automatically extract required information for various predefined categories within scientific publications? What are the essential prerequisites for achieving this objective effectively?
1.7. Outline
2. State of the Art and Related Work
2.1. Information Retrieval: Methods and Tools
2.2. Information Extraction: Principles and Approaches
2.2.1. Classical Approaches
- Rule-Based and Token-Level Methods: Traditional extraction approaches rely on deterministic rules or regular expressions to identify and extract relevant text segments. Python [10] and R [11] offer a range of libraries (e.g., PyPDF2, PdfMiner, Tabula, PDFQuery, PyMyPDF, Pytesseract) for parsing and extracting text from PDFs. These tools are effective for basic data extraction and pre-processing, such as splitting text into tokens or extracting tables. However, they struggle with complex layouts (e.g., multi-column formats, nested tables, figures with embedded text), and provide little support for semantic interpretation, making them less suitable for nuanced information extraction in scientific literature.
- Metadata and Content Extraction Tools: Automated tools such as CERMINE [12] and Astera Reportminer [13] are designed to extract metadata (e.g., title, author, affiliations) and structured content from scientific PDFs. CERMINE, for instance, uses machine learning algorithms to segment and classify document components, but is limited to single-document processing. Astera Reportminer, a commercial solution, enables batch extraction and exports data in structured formats (e.g., XLSX, XML), but access is restricted by licensing and proprietary constraints. These tools are valuable for building metadata repositories, but their utility for deep, domain-specific extraction is limited.
2.2.2. Ontology-Based and Semantic Approaches
2.2.3. Deep Learning, NLP, and AI-Based Approaches
2.3. Summary of Challenges and Gaps
2.4. Conclusion and Outlook
3. Materials and Methods
3.1. Use Case
3.1.1. Description
3.1.2. Data Basis
- Selection Process: The scientific literature on AM has undergone a rigorous selection process to establish the dataset for this study. The articles were manually chosen based on their relevance to the field of AM, with a specific focus on the usage of Ti6Al4V powder. The selection process involved an initial keyword-based search using scientific publication platforms such as ScienceDirect and ResearchGate. This approach resulted in a collection of 18 scientific publications, all published between 2015 and 2018.
- Anonymization: To ensure an unbiased analysis, free from the influence of author or publication reputation, and to prevent any correlation between the results of the applied information extraction methods and specific author names or publishers, all publications were preprocessed. In this step, journal and conference names were replaced with single letters, and author names were omitted. While this anonymization makes it difficult to identify individual publications, it is important to note that, in theory, metadata such as the number of pages and figures could still allow for narrowing down the search, provided sufficient computational resources and time are available.
- Characteristics: The selected publications are provided as PDF files and originate from a variety of domain-specific journals and conference proceedings. Each publication exhibits unique layouts, such as single-column or two-column per page, and varies in length. Additionally, the proportion of tables and figures differs significantly among the publications. All selected articles are accessible online and are classified as open access. Table 1 summarizes the dataset, detailing aspects such as the number of tables and figures in each publication, as well as the number of columns per page. Figure 1 provides an overview of the number of tokens present in the documents. As shown in the right-hand panel of Figure 1, most documents contain approximately 500 tokens per page. However, four documents exhibit a higher density of textual information, which is attributable to their two-column-per-page layout.
3.1.3. Search Categories
- Base material production,
- manufacturing process,
- heat treatment,
- tensile test,
- fatigue test.
3.1.4. Reference Values
3.2. Automatic Information Extraction
- User Input Phase: The objective is to define the parameters and criteria for the extraction process. Output is a set of user-defined criteria and patterns guiding the subsequent phases.
- Data Preparation Phase: The objective is to process and structure the raw textual data extracted from PDF files into an analyzable format. Output is a clean, structured dataset ready for information extraction.
- Information Extraction Phase: The objective is to extract and classify relevant information based on the prepared dataset and user-defined criteria. Output is a structured set of extracted information, ready for analysis and application.
3.2.1. User Input
- Definition of Search Categories: Users are required to provide a set of predefined search categories. These categories specify the types of information to be extracted, such as keywords, physical parameters, or other domain-specific data. Clear definition of these categories ensures that the extraction process targets relevant information, enhancing the precision of results.
- Provision of Search Patterns: For each search category, users must supply appropriate search patterns for both keys and values: Keys are strings or patterns representing terms related to the search category (e.g., “Young’s modulus”), and values are strings or patterns representing the possible data values associated with the keys (e.g., numerical ranges). Search patterns can be defined using regular expressions, allowing for flexible and robust pattern matching. Logical operators within these expressions enable complex queries, enhancing the system’s adaptability to diverse data formats.
- Section-Based Search Restriction: Users can optionally restrict the search scope to specific sections of the text, as identified during the data preparation phase. This fine-tuning reduces the algorithm’s runtime by focusing only on relevant sections. However, this feature depends on the accuracy of section detection. Misidentified or undetected sections may result in missed key-value pairs, potentially omitting relevant data.
- Definition of Numerical Bounds: For search categories involving numerical values, users can define upper and lower bounds to restrict the range of acceptable results. This step ensures that only feasible and reasonable values are considered. By narrowing the search range, this configuration reduces false positives and improves the specificity of the results. For instance, reasonable values for the Young’s modulus might be set between [50 GPa; 200 GPa], whereas tensile strength might be constrained to [500 MPa; 1500 MPa].
- Search Category: “Layer Thickness”: This denotes the type of information the user aims to extract.
- Search Pattern for Key: “layer”, “thick(ness)”: These patterns specify the terms related to the search category, allowing the system to identify relevant keywords in the text.
- Search Pattern for Value: “<floating-point number> <prefix> m”: This pattern defines the expected format of the values associated with the key, such as numerical measurements with units (e.g., “0.01 mm”).
- Range for Value: “<floating-point number> <prefix> m”: This provides the acceptable range for the numerical values, ensuring only relevant and feasible data are extracted.
- Search Sections: Introduction, Setup, Results: These are the sections of the document where the system will focus its search, optimizing runtime and improving accuracy by narrowing the scope to relevant content.
3.2.2. Data Preparation
- Step 1. Extraction of Textual Data:
- 1.1. Segmentation of PDF Elements: The content of PDF files is first divided into distinct blocks, separating elements such as tables, figures, text, footers, and captions, compare Algorithm 1. This segmentation is achieved using layout-based indicators, notably vertical spacing, font size, and token positions. This step ensures that the document is segmented into logical units, providing a foundation for subsequent classification and analysis.
- 1.2. Classification of Detected Blocks: Block types are detected based on layout features such as vertical spacing between lines, font sizes, and the order and position of keywords, compare Algorithm 2. These indicators help in isolating meaningful content from surrounding noise.
- Step 2. Text Cleaning:
- 2.1. Stop Word Removal: Commonly occurring but contextually insignificant words, such as articles and prepositions, are removed to reduce noise and enhance the focus on relevant textual data.
- 2.2. Removal of Layout-Specific Structures: Elements such as headers, footers, and other repetitive layout-specific structures are eliminated to streamline the dataset.
- Step 3. Text Structure Refinement:
- 3.1. Sentence Decomposition: Text sections are broken down into individual sentences, facilitating easier analysis and processing in subsequent steps.
- 3.2. Word Normalization: Techniques such as stemming (reducing words to their root forms) and lemmatization (transforming words into their base forms) are applied to standardize the textual data.
Algorithm 1 Algorithm for segmenting each document in a collection into logical text blocks by analyzing layout features such as font height, line spacing, and vertical gaps. |
|
Algorithm 2 Algorithm for classifying text blocks in a document collection by analyzing content, font size, and layout features to automatically identify structural elements such as section headings, figure captions, table captions, headers, and footers |
|
Segmentation of PDF Elements
- Document Analysis (lines 1–2): For each document, the algorithm analyses all tokens across its pages to determine the most common font height and line spacing. These values are used to estimate what constitutes a ’normal’ line gap in the document.
- Threshold Computation (lines 3): The algorithm computes a threshold for vertical gaps between tokens, typically by multiplying the most common line spacing by a constant factor. This threshold distinguishes between regular line breaks and significant vertical gaps that indicate block boundaries.
- Page Processing (lines 4–8): On each page, the algorithm sorts all tokens by their vertical (y) position. It then calculates the vertical distance between consecutive tokens. Whenever a gap exceeds the computed threshold, a new block boundary is defined. The algorithm records the start and end index, as well as the position coordinates, for each block.
- Result Compilation (lines 9–10): For each page, the block information is stored in a structured form (such as a data frame or table). All page block data are collected into a list for the document. After processing all documents, the function returns a list of block information for every document in the collection.
Classification of Detected Blocks
- Preparation (lines 1–2 of Algorithm 2): The algorithm begins by preparing a set of section name patterns (such as “Abstract”, “Introduction”, etc.) to assist in identifying section headings. It also retrieves unique document IDs for reference and verification.
- Document and Page Iteration (lines 3–4): For each document in the collection, and for each page within that document, the algorithm iterates through all detected blocks. Each block is initially assigned the default type ’Text’.
- Feature Extraction (lines 5–6): For each block, the algorithm extracts:
- -
- The block’s text content (first and last token)
- -
- Font size and vertical position (y-coordinates)
- -
- Block width (difference between leftmost and rightmost x-coordinates)
- -
- The number of tokens in the block
- Block Type Detection (lines 7–13): Several rules are applied in order of priority:
- -
- Header: If the block uses a different font size and is positioned at the top of the page, it is classified as a header.
- -
- Footer: If the block uses a different font size, is at the bottom of the page, and is separated by a large vertical gap from the previous block, it is classified as a footer.
- -
- Figure/Table Caption: If the block starts with “Fig” or “Tab” and has an unusual font size, it is classified as a figure or table caption, respectively.
- -
- Section Heading: If the block starts with a section keyword or a numbered pattern and is relatively narrow (not extending to the end of the column), it is classified as a section heading.
- -
- Text: If none of the above conditions are met, the block remains classified as regular text.
The order of these checks is crucial, as a block may match several criteria, but only the first matching type is assigned. - Update and Output (line 17): The determined type is assigned to the block in the output structure. After all blocks are processed, the updated classification is returned for the entire document collection.
Document- and Token-Level Data Structures for Scientific Texts
- 1.
- Document-Level Metadata Extraction: For each scientific publication, key metadata such as document ID, title, authors, creation and modification dates, and subject areas are extracted. These data points are organized in a dedicated metadata table (see Table 4), enabling easy identification, filtering, and referencing of documents.
- 2.
- Text Tokenization and Annotation: The full text of each publication is processed and segmented into logical blocks, sentences, and tokens. Each token is annotated with linguistic features (lemma, part-of-speech, named entity) and positional information. This information is stored in a token-level table (see Table 5), with each entry referencing the corresponding document via a shared Doc ID.
- 3.
- Relational Linking: The two tables are linked by the Doc ID, ensuring that every token or sentence can be traced directly to its parent document and associated metadata. This relational structure supports robust querying and cross-referencing.
- Clarity and Reproducibility: The separation of metadata and token-level information provides clear data organization and supports reproducible workflows.
- Scalability: The relational design allows for efficient storage and processing, even with large document collections.
- Analytical Flexibility: Researchers can perform analyses at both document and token level, enabling a wide range of text mining and information extraction tasks.
- Traceability: The Doc ID linkage ensures that all data points remain connected to their original context, supporting transparency and data integrity.
Limitations and Requirements for PDF Processing
- Handling Multiple Layers in PDF Files: PDF files often contain multiple content layers, with only one layer typically visible in standard PDF viewers. To address this, the proposed approach requires the flattening of PDF files prior to processing. Flattening ensures that hidden layers are merged and their content becomes accessible for extraction. However, this process is effective only if the PDF files have not already been flattened into a single layer during their creation or prior modifications. If the original layered structure is lost, some information might remain inaccessible.
- Importance of Block Detection and Classification: The processing workflow includes two critical steps: block detection and block type classification. Of these, block detection is particularly crucial, as it establishes the foundation for subsequent analyses. If the detected blocks do not align with the semantic requirements of the system, the subsequent information extraction process, which relies heavily on the accurate segmentation and extraction of sentences, may fail. Ensuring the precise detection of text blocks is therefore essential to maintaining the operability and accuracy of the entire extraction pipeline.
3.2.3. Information Extraction
- Keyword Extraction: The process involves identifying terms and phrases that are specific to the field of AM research. Additionally, it accounts for variations in how keywords are represented, addressing challenges such as typographical errors and inconsistencies in terminology.
- Value Range Classification: The process includes extracting numerical values along with their associated units, such as physical parameters, directly from the text. Furthermore, it ensures the accurate classification of these values, even in the presence of discrepancies in unit representation or formatting errors.
- Pattern Matching: Specific patterns are defined to identify keywords and numerical values within the text. These patterns are designed to accommodate variations in spelling, formatting, and unit representation.
- Category-Specific Processing: For keywords, the algorithm uses a predefined list of domain-specific terms, supplemented by contextual analysis to identify relevant additions. For value ranges, the algorithm identifies numerical data and validates associated units, ensuring consistency and relevance to the AM domain.
- Error Handling and Refinement: Mechanisms are implemented to detect and correct errors in extracted data, such as mismatched units or incomplete numerical entries. The system iteratively refines its output by cross-referencing extracted data with predefined standards and expected formats.
3.2.4. Algorithm
Algorithm 3 Algorithm for automatic extraction of key-value pairs from prepared scientific text sections using user-defined patterns for keys and values, with optional numerical value range constraints, and storage of results together with sentence identifiers. |
|
3.2.5. Implementation
3.3. Information Extraction Using ChatPDF
Listing 1. Query in ChatPDF used for extracting information from each document of the data basis and each search category. |
4. Results
4.1. Method Automatic
4.1.1. Data Preparation
4.1.2. Information Extraction
4.2. Method chatPdf
5. Discussion
5.1. Comparisons
- True positive (TP): Both the proposed and the reference method successfully identify the information. This corresponds to cases where the entry is filled in both tables.
- False positive (FP): The proposed method identifies information (entry filled in the table created using the proposed method), but the reference method does not (entry empty in the table created using the reference method).
- False negative (FN): The proposed method does not identify information (entry empty in the table created using the proposed method), but the reference method does (entry filled in the table created using the reference method).
- True negative (TP): Both the proposed and the reference method successfully do not identify the information. This corresponds to cases where the entry is empty in both tables.
5.1.1. Automatic vs. Manual
- Minimal Impact on Data Quality: A 4.4% FP rate indicates that only a small proportion of irrelevant or incorrect information is included in the systematic literature review results.
- Limited Additional Manual Effort: The need for researchers to review and filter false positives is reduced compared to higher FP rates, preserving the efficiency gains of the automated workflow.
- Lower Risk of Misleading Analyses: The relatively low number of false positives minimizes the potential for inaccuracies in subsequent analyses or meta-studies, supporting the reliability of scientific conclusions.
- Continued Room for Optimization: While the workflow already demonstrates strong extraction performance, further reducing the FP rate could enhance the system’s trustworthiness and practical value even more.
- Noticeable Information Loss: A 19% FN rate indicates that a substantial amount of pertinent data are omitted, which can negatively impact the overall quality of the extracted dataset.
- Impact on Completeness: Such a high rate of missed values may result in significant gaps, thereby compromising the comprehensiveness and reliability of systematic reviews or meta-analyses. Inadvertent exclusion of relevant studies or critical information could introduce bias into the review outcomes.
- Clear Need for Improvement: These findings highlight the inherent limitations of rule-based extraction approaches and underscore the need for integrating more advanced natural language processing techniques to reduce missed values and improve accuracy.
5.1.2. ChatPdf vs. Manual
- The very high precision (0.988) and specificity (0.987) indicate that ChatPDF rarely extracts irrelevant information, making its results highly reliable.
- The accuracy (0.783) and F1 score (0.790) are solid, but reflect the imbalance between the extremely high precision and the comparatively lower recall.
- The recall (0.659) is noticeably lower, meaning that a significant proportion of relevant information is missed (a higher rate of false negatives).
5.2. Benefits
- Efficiency: The automatic extraction tool dramatically reduces the time required to process and analyze scientific publications. What previously took hours of manual work can now be completed in seconds per document, enabling researchers to focus on higher-level analysis and synthesis.
- Scalability: The workflow is capable of handling large datasets, making it suitable for systematic reviews or meta-analyses involving hundreds of publications. This scalability facilitates comprehensive literature coverage and supports data-driven research.
- Accessibility: Designed for use on standard hardware and without the need for advanced programming skills, the tool lowers the barrier for adoption among researchers from various backgrounds.
- Consistency and Reproducibility: Automated extraction ensures consistent application of search patterns and rules, reducing human error and subjective bias. This enhances the reproducibility of literature reviews and data extraction processes.
- Higher Recall and Flexibility: The proposed rule-based approach achieves a higher recall than the ChatPDF-based method, meaning it successfully captures a greater proportion of relevant information present in the articles. This reduces the risk of omitting important data and is particularly advantageous for systematic reviews where completeness is critical. The method is well-suited for scenarios where sensitivity and comprehensive data collection are prioritized over absolute precision, ensuring that fewer relevant studies or data points are missed.
- Customizability: Through the use of regular expressions and configurable search patterns, the tool can be adapted to different research domains and evolving information needs.
- Open Science Potential: The planned open-source release and possible web-based implementation foster collaboration, transparency, and continuous improvement within the research community.
- Data Security and Local Processing: The proposed approach does not require specialized hardware and can be executed on a standard local client. Local processing ensures a high level of data security, as sensitive documents do not need to be transferred to third parties. Additionally, the method enables rapid and efficient processing of typical scientific publications.
- User-Friendliness: The intuitive interface and straightforward workflow allow researchers to manage and organize large document collections efficiently, even without technical expertise.
5.3. Limitations
- Data Quality Dependency: The accuracy of extraction is highly dependent on the quality of the text extracted from PDFs. Poor OCR results, corrupted files, or inconsistent PDF structures can lead to incomplete or erroneous data.
- Complex Document Layouts: Multi-column formats, embedded figures, tables, and non-standard layouts present significant challenges for accurate block detection and semantic preservation. This may result in fragmented or misaligned information extraction.
- Lower Precision and Manual Effort Required: Compared to ChatPDF, the rule-based method demonstrates lower precision and specificity, which leads to a higher rate of false positives. This means that more irrelevant or incorrectly classified information is included in the results. The inclusion of irrelevant data requires additional manual review and filtering by researchers, which can partially negate the efficiency gains achieved through automation. While recall is higher than with ChatPDF, the overall balance between precision and recall (as reflected in the F1 score) may be less favorable, indicating a trade-off between completeness and the need for manual post-processing.
- Effort Required for Defining Search Patterns: Implementing the approach necessitates a certain degree of effort, particularly in defining and adjusting suitable regular expressions (“keys”). Developing regular expressions for physical quantities involving floating-point numbers and units is especially complex. While the selection of categories is generally appropriate, distinguishing between categories with similar target entities can be challenging.
- Search Pattern Sensitivity: The effectiveness of extraction relies heavily on the definition and specificity of search patterns. Inflexible or poorly defined patterns may miss relevant information or generate false positives.
- Limited Contextual Understanding: The tool currently lacks sophisticated mechanisms for contextual analysis, which can lead to misinterpretation of ambiguous or domain-specific terms. Human review is still necessary to ensure the relevance and accuracy of extracted data.
- Manual Pre- and Post-Processing: Some manual intervention may still be required for pre-processing (e.g., handling protected PDFs, ensuring file quality) and post-processing (e.g., validation, data cleaning), which can limit the overall automation benefit.
- Domain Adaptation: While the tool is customizable, significant adaptation effort may be needed to apply it to domains with very different terminology, reporting standards, or document structures.
- Handling of Non-Textual Data: Extraction of information from images, complex tables, or graphical elements remains limited, potentially omitting valuable data present in figures.
- Legal and Ethical Considerations: Automated extraction from copyrighted or paywalled documents may raise legal or ethical issues, especially when sharing or publishing extracted data.
- Scalability Constraints: Although scalable for moderate datasets, extremely large-scale deployments may require additional optimization or parallel processing capabilities.
5.4. Ethical and Copyright Implications
6. Conclusions
6.1. Summary
- RQ1: Automated Parameter Extraction: A structured methodology was implemented to determine the extent to which parameters—such as physical ranges and material properties—can be automatically extracted from scientific publications. Tailored search patterns and algorithms were developed and applied, as described in Section 3.2.3. The outcomes, including the efficiency and accuracy of parameter extraction, are presented and discussed in Section 4 and Section 5.
- RQ2: Reliability of Extracted Information: The reliability of the automatically extracted information was evaluated by comparison with manually curated data. This assessment, detailed in Section 5, provided insights into the system’s precision and recall, highlighting its strengths and areas for improvement.
- RQ3: Category-Specific Information Extraction: The feasibility of extracting information for predefined categories was explored by designing and applying category-specific search patterns, as outlined in Section 3.1.3. The study demonstrated that, given well-defined search criteria, the system can effectively extract targeted information across multiple categories.
6.2. Key Benefits
- Efficiency: Automated extraction reduces processing time from hours to seconds per document, allowing researchers to quickly process large volumes of literature.
- Accessibility: The tool is designed for use on standard hardware and does not require advanced technical expertise, making it accessible to a broad user base.
- Accuracy: The method delivers a high match rate with manual extraction, ensuring reliable and precise results.
- Scalability: The workflow is suitable for large document collections and can be adapted for use in other scientific domains beyond additive manufacturing.
- Open Science: With a planned open-source release and potential web-based implementation, the tool fosters collaboration and broad accessibility within the research community.
- User-Friendliness: Researchers can import PDF files in a straightforward and intuitive manner, making the tool highly effective for managing and organizing existing collections of documents.
6.3. Key Limitations
- Data Quality: The reliability of the extraction process is highly dependent on the quality of the text obtained from PDF documents. Errors in recognizing or segmenting PDF elements can disrupt semantic coherence, making high-quality input data essential for accurate results.
- Definition of Search Patterns: The effectiveness of information extraction is strongly influenced by the precision of the search patterns used. Poorly defined or overly broad patterns may result in missed or irrelevant information.
- Document Layout: Variations in document formatting—such as multi-column layouts or inconsistent journal structures—present significant challenges for accurate extraction. These layout differences can hinder the correct preservation of text semantics. Enhancing data pre-processing and exploring AI-based methods for block type identification could improve extraction quality.
- Contextual Relevance: While the tool provides a robust foundation for identifying target information, it does not yet reliably determine the contextual relevance of extracted information. At this stage, human evaluation remains necessary to ensure the applicability and accuracy of the results.
6.4. Future Work and Outlook
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Chowdhury, G.G. Introduction to Modern Information Retrieval; Facet Publishing: London, UK, 2010. [Google Scholar]
- Nasar, Z.; Jaffry, S.W.; Malik, M. Information extraction from scientific articles: A survey. Scientometrics 2018, 117, 1931–1990. [Google Scholar] [CrossRef]
- Zhu, R.; Tu, X.; Xiangji Huang, J. Chapter seven—Deep learning on information retrieval and its applications. In Deep Learning for Data Analytics; Das, H., Pradhan, C., Dey, N., Eds.; Academic Press: Cambridge, MA, USA, 2020; pp. 125–153. [Google Scholar] [CrossRef]
- Esteva, A.; Kale, A.; Paulus, R.; Hashimoto, K.; Yin, W.; Radev, D.; Socher, R. COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization. npj Digit. Med. 2021, 4, 68. [Google Scholar] [CrossRef] [PubMed]
- Grace, S.; Rosenthal, J. Sourcing and Referencing; Brill: Leiden, The Netherlands, 2009. [Google Scholar] [CrossRef]
- Chen, C. Science Mapping: A Systematic Review of the Literature. J. Data Inf. Sci. 2017, 2, 1–40. [Google Scholar] [CrossRef]
- Gusenbauer, M. Google Scholar to Overshadow Them All? Comparing the Sizes of 12 Academic Search Engines and Bibliographic Databases. Scientometrics 2019, 118, 177–214. [Google Scholar] [CrossRef]
- Marcos-Pablos, S.; García-Peñalvo, F. Information retrieval methodology for aiding scientific database search. Soft Comput. 2020, 24, 5551–5560. [Google Scholar] [CrossRef]
- Wang, Y.; Wang, L.; Rastegar-Mojarad, M.; Moon, S.; Shen, F.; Afzal, N.; Liu, S.; Zeng, Y.; Mehrabi, S.; Sohn, S.; et al. Clinical Information Extraction Applications: A Literature Review. J. Biomed. Inform. 2017, 77, 34–49. [Google Scholar] [CrossRef] [PubMed]
- Welcome to Python.org. Available online: https://www.python.org/ (accessed on 29 June 2025).
- R: The R Project for Statistical Computing. Available online: https://www.r-project.org/ (accessed on 29 June 2025).
- Tkaczyk, D.; Szostek, P.; Fedoryszak, M.; Dendek, P.J.; Bolikowski, L. CERMINE: Automatic extraction of structured metadata from scientific literature. Int. J. Doc. Anal. Recognit. (IJDAR) 2015, 18, 317–335. [Google Scholar] [CrossRef]
- Ahmed, I. Astera ReportMiner. Available online: https://www.astera.com/products/report-miner/ (accessed on 29 June 2025).
- Gwizdka, J.; Hansen, P.; Hauff, C.; He, J.; Kando, N. Search as Learning (SAL) Workshop 2016. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, Pisa, Italy, 17–21 July 2016; SIGIR ’16. pp. 1249–1250. [Google Scholar] [CrossRef]
- Müller, H.M.; Kenny, E.; Sternberg, P. Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature. PLoS Biol. 2004, 2, e309. [Google Scholar] [CrossRef] [PubMed]
- Ceci, F.; Pietrobon, R.; Goncalves, A. Turning Text into Research Networks: Information Retrieval and Computational Ontologies in the Creation of Scientific Databases. PLoS ONE 2012, 7, e27499. [Google Scholar] [CrossRef] [PubMed]
- Dragoni, M.; da Costa Pereira, C.; Tettamanzi, A. A Conceptual Representation of Documents and Queries for Information Retrieval Systems by Using Light Ontologies. Expert Syst. Appl. 2012, 39, 10376–10388. [Google Scholar] [CrossRef]
- Lopez, P. GROBID: Combining automatic bibliographic data recognition and term extraction for scholarship publications. In Proceedings of the 13th European Conference on Research and Advanced Technology for Digital Libraries, Corfu, Greece, 27 September–2 October 2009. [Google Scholar] [CrossRef]
- Welcome to Pdfminer.six’s Documentation!—Pdfminer.six__VERSION__Documentation. Available online: https://pdfminersix.readthedocs.io/en/latest/ (accessed on 29 June 2025).
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL-HLT, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 4–9 December 2017; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000–6010. [Google Scholar]
- Kardas, M.; Borchmann, L.; Bembenek, P.; Lewandowski, M.; Marcinczuk, M.; Gawor, M.; Rybak, P.; Wroblewska, A.; Rychlikowski, P.; Kocon, J. Document Structure Recognition: A Review. arXiv 2020, arXiv:2008.05961. [Google Scholar] [CrossRef]
- Ponte, J.; Croft, W. A Language Modeling Approach to Information Retrieval. ACM SIGIR Forum 2017, 51, 202–208. [Google Scholar] [CrossRef]
- Zhang, W.; Zhao, X.; Zhao, L.; Yin, D.; Yang, G.H.; Beutel, A. Deep Reinforcement Learning for Information Retrieval: Fundamentals and Advances. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, China, 25–30 July 2020; SIGIR ’20. pp. 2468–2471. [Google Scholar] [CrossRef]
- Tshitoyan, V.; Dagdelen, J.; Weston, L.; Dunn, A.; Rong, Z.; Kononova, O.; Persson, K.A.; Ceder, G.; Jain, A. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 2019, 571, 95–98. [Google Scholar] [CrossRef] [PubMed]
- Authors of chatpdf.com. ChatPDF—Chat with any PDF! Available online: https://www.chatpdf.com/ (accessed on 29 June 2025).
- Kovacevic, A.; Ivanovic, D.; Milosavljević, B.; Konjovic, Z.; Surla, D. Automatic extraction of metadata from scientific publications for CRIS systems. Program Electron. Libr. Inf. Syst. 2011, 45, 376–396. [Google Scholar] [CrossRef]
- Dieb, S.; Yoshioka, M.; Hara, S.; Newton, M. Framework for automatic information extraction from research papers on nanocrystal devices. Beilstein J. Nanotechnol. 2015, 6, 1872–1882. [Google Scholar] [CrossRef] [PubMed]
- ScienceBeam: Using Open Technology to Extract Knowledge from Research PDFs. Available online: https://elifesciences.org/labs/743da0fc/sciencebeam-using-open-technology-to-extract-knowledge-from-research-pdfs (accessed on 29 June 2025).
- Raßloff, A.; Feldhoff, K.; Wiemer, H.; Zimmermann, M.; Kästner, M. AMTwin-Datengetriebene Prozess-, Werkstoff- und Strukturanalyse für die additive Fertigung. In Proceedings of the Mobilität der Zukunft–Bauteilzuverlässigkeit im digitalen Zeitalter-DVM-Tag 2023, Berlin, Germany, 29–30 March 2023. [Google Scholar] [CrossRef]
- Li, P.; Warner, D.H.; Fatemi, A.; Phan, N. Critical assessment of the fatigue performance of additively manufactured Ti–6Al–4V and perspective for future research. Int. J. Fatigue 2016, 85, 130–143. [Google Scholar] [CrossRef]
- Liu, S.; Shin, Y.C. Additive manufacturing of Ti6Al4V alloy: A review. Mater. Des. 2019, 164, 107552. [Google Scholar] [CrossRef]
- R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2023. [Google Scholar]
- Crews, K.D. Copyright and the academic author: Legal rights and responsibilities. Portal Libr. Acad. 2012, 12, 297–317. [Google Scholar]
- Creative Commons. About The Licenses. Available online: https://creativecommons.org/licenses/ (accessed on 6 August 2025).
- International Association of Scientific, Technical and Medical Publishers. STM Text and Data Mining Guidelines. Available online: https://www.stm-assoc.org/intellectual-property/text-and-data-mining/ (accessed on 6 August 2025).
- Committee on Publication Ethics. COPE Guidelines. Available online: https://publicationethics.org/guidance/Guidelines (accessed on 6 August 2025).
Lid | Gid | Type | Publisher | nP | nC | nW | pW | nT | nF |
---|---|---|---|---|---|---|---|---|---|
1 | 001 | P | A | 6 | 1 | 3705 | 618 | 2 | 4 |
2 | 002 | P | A | 6 | 1 | 2990 | 498 | 2 | 6 |
3 | 003 | P | A | 9 | 1 | 5606 | 623 | 1 | 4 |
4 | 008 | J | B | 14 | 2 | 13,367 | 955 | 3 | 18 |
5 | 011 | J | B | 17 | 2 | 9039 | 532 | 3 | 21 |
6 | 013 | P | C | 7 | 1 | 3733 | 533 | 3 | 5 |
7 | 014 | P | D | 4 | 1 | 2491 | 623 | 0 | 6 |
8 | 016 | J | E | 8 | 1 | 5219 | 652 | 1 | 5 |
9 | 017 | J | B | 9 | 1 | 4162 | 462 | 1 | 8 |
10 | 018 | P | A | 4 | 2 | 3529 | 882 | 3 | 6 |
11 | 019 | J | B | 8 | 1 | 3793 | 474 | 0 | 7 |
12 | 020 | P | F | 6 | 1 | 3806 | 634 | 1 | 4 |
13 | 023 | P | A | 9 | 1 | 4671 | 519 | 3 | 8 |
14 | 024 | P | G | 4 | 2 | 4135 | 1034 | 0 | 5 |
15 | 025 | J | B | 5 | 1 | 3532 | 706 | 1 | 2 |
16 | 037 | P | A | 11 | 2 | 11,109 | 1010 | 6 | 9 |
17 | 038 | P | H | 8 | 1 | 3331 | 417 | 1 | 7 |
18 | 046 | J | I | 11 | 2 | 12,519 | 1138 | 4 | 12 |
Id | Search Category Name | Description | Data Type |
---|---|---|---|
1 | Base_Material_Name | Name of used for base material | String |
2 | Base_Material_Specification | Specification of base material | String |
3 | Base_Material_Shape | Shape of base material | String |
4 | Base_Material_Grain_Size | Grain size in micrometer of base material | Number |
5 | Base_Material_Supplier_Production | Name of supplier who produced base material | String |
6 | Manufacturing_Process_Name | Name of manufacturing process | String |
7 | Manufacturing_Machine | Name of machine used for manufacturing | String |
8 | Manufacturing_Laser_Power | Laser power in Watt used for manufacturing | Number |
9 | Manufacturing_Scan_Strategy | Scan strategy used for manufacturing | String |
10 | Manufacturing_Layer_Thickness | Layer thickness in micrometer used for manufacturing | Number |
11 | Manufacturing_Energy_Density | Energy density in J/mm3 used for manufacturing | Number |
12 | Heat_Treatment_Method | Name of heat treatment method | String |
13 | Heat_Treatment_Temperature | Temperature in degrees Celsius used during heat treatment | Number |
14 | Heat_Treatment_Duration | Duration in hours of heat treatment | Number |
15 | Heat_Treatment_Atmosphere | Atmosphere during heat treatment | String |
16 | Heat_Treatment_Cooling | Name of cooling type during heat treatment | String |
17 | Microstructure_Hardness | Measured Vicker’s hardness in HV | Number |
18 | Microstructure_Density_Pores | Measured pore density measure during microstructure test | Number |
19 | Tensile_Test_Samples_Surface_Condition | Surface condition of samples during tensile test | String |
20 | Tensile_Test_Youngs_Modulus | Young’s modulus in GPa during tensile test | Number |
21 | Tensile_Test_Yield_Strength | Measured yield strength in MPa in tensile test | Number |
22 | Tensile_Test_Ultimate_Strength | Measured ultimate strength in MPa in tensile test | Number |
23 | Tensile_Test_Elongation_At_Failure | Measured elongation at failure in tensile test | Number |
24 | Fatigue_Samples_Surface_Condition | Surface condition of samples in fatigue test | String |
25 | Fatigue_Samples_Loading_Direction | Loading direction of test samples in fatigue test | String |
26 | Fatigue_Test_Bench | Test bench used in fatigue test | String |
27 | Fatigue_Test_Standard | Standard of fatigue test | String |
28 | Fatigue_Test_Type_Test | Type of fatigue test | String |
29 | Fatigue_Test_Loading_Direction | Loading direction in fatigue test | String |
30 | Fatigue_Test_Frequency | Test frequency in fatigue test | Number |
31 | Fatigue_Test_R | Stress ratio R in fatigue test | Number |
32 | Fatigue_Test_NG | Maximum number of cycles in fatigue test | Number |
33 | Fatigue_Test_Cause_Failure | Cause of failure of sample in fatigue test | String |
Search Category | Feasible Values for Keyword | Regular Expression for Keyword | Feasible Values for Value | Regular Expression for Value |
---|---|---|---|---|
Shape of base material | build, built, produced, producing, manufacturing, manufactured, fabricated, fabricating | (buil(d|t)1,1(d|t)0,1 |produc(ed|ing)0,1 |manufactur(ed|ing)0,1 |fabricat(ed|ing)0,1) | powder, solid, liquid | (powder|solid |liquid) |
Grain size | grain diameter, particle diameter, powder size, particle size, grain size, granulometry | (grain(\s?)diameter |particle(\s?)diameter |powder(\s?)size |particle(\s?)size |size.*particle |grain(\s?)size |granulometry) | “<floating-point number> <prefix> m” | (\d)1, \s?((.)1,1\s?(m)1,1) ([[\s.,]])0, |
Column | Data Type | Description | Example Value |
---|---|---|---|
Doc ID | String | Unique document identifier | 002-AM |
DOI | String | Digital Object Identifier for the document | 10.1038/nature12345 |
Filename | String | Name of the PDF file | 002-AM.pdf |
Title | String | Title of the document | Climate Impact on Urban Areas |
Author | String | Authors of the document | Jane Smith; John Brown |
Pages | Integer | Number of pages in the document | 12 |
Modified | DateTime | Date and time of last modification (YYYY-MM-DD hh:mm:ss) | 2021-05-14 10:23:45 |
Created | DateTime | Date and time of creation (YYYY-MM-DD hh:mm:ss) | 2021-05-10 09:00:00 |
Creator | String | Software or person who created the document | LaTeX (TeX Live 2021) |
Producer | String | Software that produced the PDF | PDFTeX-1.40.21 |
Keywords | String | Keywords related to the document | climate, urban, change |
Subjects | String | Subject areas of the document | Environmental Science |
Column | Data Type | Description | Example Value |
---|---|---|---|
Doc ID | String | Foreign key linking to the document in the metadata table | 002-AM |
Block ID | Integer | Identifier for a logical text block within the document | 1 |
Sentence ID | Integer | Identifier for a sentence within a block | 1 |
Token ID | Integer | Identifier for a token (word or punctuation) within a sentence | 1 |
Token | String | Surface form of the token as it appears in the text | Available |
Lemma | String | Lemma or base form of the token | available |
POS | String | Part-of-speech tag (e.g., NOUN, VERB, ADJ) | ADJ |
Entity | String | Named entity tag (if present) or empty if not applicable | |
Section ID | String | Identifier for the section of the document (e.g., Introduction, Methods) | Prolog |
TypeBlock | String | Type of text block (e.g., paragraph, title, table) | 1-Text |
Page | Integer | Page number in the PDF where the token is located | 1 |
IdxGlobalStart | Integer | Global character offset for the start of the token in the document | 1 |
IdxGlobalEnd | Integer | Global character offset for the end of the token in the document | 10 |
Local ID | Global ID | Values | Keys | Sections | Sentence IDs |
---|---|---|---|---|---|
1 | 001 | NA | |||
2 | 002 | 30 μm | layer, thickness | Setup | 65 |
3 | 003 | 50 μm | layer, thickness | Approach | 120 |
4 | 008 | NA | |||
5 | 011 | NA | |||
6 | 013 | NA | |||
7 | 014 | NA | |||
8 | 016 | NA | |||
9 | 017 | 60 μm | layer, thickness | Setup | 89 |
10 | 018 | NA | |||
11 | 019 | 60 μm | layer, thickness | Setup | 55 |
12 | 020 | 60 μm | layer, thickness | Approach | 79 |
13 | 023 | NA | |||
14 | 024 | 100 μm | thickness | Approach | 128 |
15 | 025 | 60 μm | layer, thickness | Setup | 85 |
16 | 037 | NA | |||
17 | 038 | NA | |||
18 | 046 | 50 μm | layer, thickness | Nomenclature | 129 |
Id | Search Category Name | Value |
---|---|---|
1 | Name of used for base material | Ti-6Al-4V |
2 | Specification of base material | String |
3 | Shape of base material | Tubular |
4 | Grain size in micrometer of base material | Not specified |
5 | Name of supplier who produced base material | String |
6 | Name of manufacturing process | Laser Powder Bed Fusion |
7 | Name of machine used for manufacturing | Renishaw AM 250 |
8 | Laser power in Watts used for manufacturing | 400 W |
9 | Scan strategy used for manufacturing | Not specified |
10 | Layer thickness in micrometer used for manufacturing | 50 μm |
11 | Energy density in J/mm3 used for manufacturing | Not specified |
14 | Duration in hours of heat treatment | Not specified |
15 | Atmosphere during heat treatment | Argon |
16 | Name of cooling type during heat treatment | Not specified |
17 | Measured Vicker’s hardness in HV | Not specified |
18 | Measured pore density measure during microstructure test | Not specified |
26 | Test bench used in fatigue test | Not specified |
27 | Standard of fatigue test | ASTM Standard |
28 | Type of fatigue test | Axial-torsion |
29 | Loading direction in fatigue test | In-phase axial-torsion |
30 | Test frequency in fatigue test | 0.25 to 12 Hz |
31 | Stress ratio R in fatigue test | R = −1 |
32 | Maximum number of cycles in fatigue test | Not specified |
33 | Cause of failure of the sample in the fatigue test | Not specified |
Measure | Definition/Formula | Value |
---|---|---|
Accuracy A | (TP + TN)/(TP + FN + FP + TN) | 0.766 |
Precision P | TP/(TP + FP) | 0.908 |
Recall R | TP/(TP + FN) | 0.693 |
Specificity S | TN/(TN + FP) | 0.885 |
F1 measure | 2 × (P × R)/(P + R) | 0.786 |
Measure | Definition/Formula | Value |
---|---|---|
Accuracy A | (TP + TN)/(TP + FN + FP + TN) | 0.783 |
Precision P | TP/(TP + FP) | 0.988 |
Recall R | TP/(TP + FN) | 0.659 |
Specificity S | TN/(TN + FP) | 0.987 |
F1 measure | 2 × (P × R)/(P + R) | 0.790 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Feldhoff, K.; Wiemer, H.; Träger, P.; Kühne, R.; Zimmermann, M.; Ihlenfeldt, S. Automatic Information Extraction from Scientific Publications Based on the Use Case of Additive Manufacturing. Appl. Sci. 2025, 15, 9331. https://doi.org/10.3390/app15179331
Feldhoff K, Wiemer H, Träger P, Kühne R, Zimmermann M, Ihlenfeldt S. Automatic Information Extraction from Scientific Publications Based on the Use Case of Additive Manufacturing. Applied Sciences. 2025; 15(17):9331. https://doi.org/10.3390/app15179331
Chicago/Turabian StyleFeldhoff, Kim, Hajo Wiemer, Philip Träger, Robert Kühne, Martina Zimmermann, and Steffen Ihlenfeldt. 2025. "Automatic Information Extraction from Scientific Publications Based on the Use Case of Additive Manufacturing" Applied Sciences 15, no. 17: 9331. https://doi.org/10.3390/app15179331
APA StyleFeldhoff, K., Wiemer, H., Träger, P., Kühne, R., Zimmermann, M., & Ihlenfeldt, S. (2025). Automatic Information Extraction from Scientific Publications Based on the Use Case of Additive Manufacturing. Applied Sciences, 15(17), 9331. https://doi.org/10.3390/app15179331