Layout Aware Semantic Element Extraction for Sustainable Science & Technology Decision Support

: New scientiﬁc and technological (S&T) knowledge is being introduced rapidly, and hence, analysis efforts to understand and analyze new published S&T documents are increasing daily. Automated text mining and vision recognition techniques alleviate the burden somewhat, but the various document layout formats and knowledge content granularities across the S&T ﬁeld make it challenging. Therefore, this paper proposes LA-SEE (LAME and Vi-SEE), a knowledge graph construction framework that simultaneously extracts meta-information and useful image objects from S&T documents in various layout formats. We adopt Layout-aware Metadata Extraction (LAME), which can accurately extract metadata from various layout formats, and implement a transformer-based instance segmentation (i.e., Vision based Semantic Elements Extraction (Vi-SEE)) to maximize the vision-based semantic element recognition. Moreover, to constructing a scientiﬁc knowledge graph consisting of multiple S&T documents, we newly deﬁned an extensible Semantic Elements Knowledge Graph (SEKG) structure. For now, we succeeded in extracting about 6 million semantic elements from 49,649 PDFs. In addition, to illustrate the potential power of our SEKG, we provide two promising application scenarios, such as a scientiﬁc knowledge guide across multiple S&T documents and questions and answering over scientiﬁc tables.


Introduction
Decision support systems or specific methods for science and technology (S&T) problems or social issues can be employed effectively across various domain user types related to policymaking, research topic search, research method survey, comparing experimental results, emerging technology trend analyses, etc.
Junior researchers (or novice users) may have difficulty collecting target information due to lacking domain knowledge. However, even domain experts usually feel burdened considering the vast and rapidly growing body of scientific literature, expert blogs, commercial technical reports, and patents. Search engines are common tools for information seeking, allowing users to access related documents or paragraphs containing search queries on the premise that full document texts have been indexed. An alternative method to find relevant information for research topic or job is to visit online Q&A communities, such as Knowledge iN (of Naver), Reddit, and/or Quora. However, although most deliver substantial information, they can sometimes contain prejudiced opinions or commercial references that are untrustworthy.
Suppose these S&T documents were well separated and re-organized as reusable knowledge. Then, users could selectively access only relevant knowledge and utilize it in decision-making processes. Unfortunately, although the requirement is becoming critical, few decision support systems are available due to many technical implementation

Metadata Extraction from Articles
Research on document structure and information extraction has been steadily ongoing. Primary research directions for metadata extraction can be categorized into rule based, textual feature based machine learning, and vision based object detection. For example, SVM [14], CNN [15], and CRF [16,17] algorithms are popular techniques used with textual features. Pre-training approaches based on large scale text corpora have shown significant successes in several NLP tasks recently, including text classification and sequential labeling [12,[18][19][20][21][22].
Sufficient high-quality training datasets annotated with target labels are essential to implement a modest metadata extraction model. Each dataset may have a different annotation level, depending on the research purpose. For example, reference [14] had sentence-level metadata annotations. Reference [16] applied BIO tagging for tokenized words to train a Bi-LSTM-CRF model for metadata extraction, and reference [23] used paragraph-level (or clustered text) annotations. Other studies considered font, font size, and location information to re-organize text chunks to detect layout and extract metadata [24,25]. In contrast, reference [26] automatically annotated document layout elements (i.e., text, titles, lists, tables, and figures) to apply object detection techniques [27,28] for document layout analysis, which is related to metadata extraction.

Vision-Based Document Analysis
Several studies introduced transformers into object detection tasks, motivated by recent successes for transformers in NLP [29]. A detection transformer (DETR) [30] reconstructed complex object detection components by employing a simple transformer encoder and decoder architecture, providing a neck component to bridge the CNN body for feature extraction and a detector head for prediction. However, although DETR achieved a high detection performance, it suffered from slow convergence, e.g., DETR required 500 epochs, whereas conventional Faster R-CNN [27] training required less than 50 epochs [31]. Recent studies have confirmed the great potential for end-to-end object detection [30,32,33]. Hence, bipartite matching cost has become an essential component for achieving end-toend object detection. For example, in contrast to [34,35], segmentation explored end-to-end mechanisms with recurrent neural networks, and end-to-end ISTR [36] used the similarity metric for mask embeddings as bipartite matching cost for masks and incorporated transformers [29] to improve end-to-end instance segmentation. We use ISTR in the proposed vision-based semantic element detection task because it showed SOTA level performance even with approximation based suboptimal embeddings. Document layout analysis is an essential task in automatic document understanding. Its main goal is to identify regions of interest in unstructured documents and recognize each region's roles. However, the task is non-trivial due to document layout diversity and complexity. Many deep learning models have been proposed for this task in computer vision (CV) and NLP fields. Most consider either only visual features [26], only textual features [12], or both modalities [11]. Visual features can identify some regions (e.g., figures, tables), whereas textual features are critical to discriminate visually similar regions (e.g., keywords, abstract, affiliation, author names, etc.). However, single modality models have insufficient capability for layout modeling, hence multi-modal approaches have recently become more popular [9,10,37]. However, they typically contain only hundreds of labeled pages due to prohibitive labeling costs to annotate many layout objects per page, which is insufficient to train and evaluate deep learning based models [27]. Although some multi-modal approaches use automatic data construction methods [10,11,38], they are not interoperable because they employ fundamentally different layout object types and training data formats.

Scientific Knowledge Extraction
The NLP community includes considerable research on extracting information or knowledge from the scientific literature. Earlier studies focused on identifying citation contexts [39] and extracting key concepts [40] or phrases [41,42]. Most approaches attempted to construct knowledge bases by defining scientific entities and extracting semantic relationships between the entities [2][3][4]. More recently, reference [8] constructed task-dataset-metric triples from NLP papers by extracting entities and their relationships within and across different sentences/documents.

Document Modeling
Ronzano and Saggion [43] proposed a platform to extract vast amounts of structural and semantic information from scientific publications, represented as Resource Description Framework (RDF) datasets. Yang et al. [44] designed a weakly-supervised text-to-graph neural network to provide concise, structured representations for documents, by generating concept maps connecting important concepts and interaction links. Zheng et al. [45] introduced four granularity levels for document modeling: documents, paragraphs, sentences, and tokens, reflecting the natural hierarchical document structure. More recently, reference [9] defined a document structure tree model to organize knowledge element extraction from documents and determine their relationships, such as juxtaposition and inclusive, between sections at different levels.
The above works motivated us to extract key semantic elements within the document and derive critical links across multiple documents using the proposed document network structure. Unlike existing knowledge graph construction research, S&T documents exist at the center of reusable knowledge extraction in this study. Therefore, general metadata of documents and their figures, tables, and references were considered semantic elements of knowledge construction. Section 3.3 defines the semantic element knowledge graph (SEKG) because a large number of documents can be interconnected to build vast S&T knowledge. It can be linked to the knowledge graph based on the triple sentences (e.g., relation-entity1-entity2), but we focus on extracting and connecting the document's metadata and the figures and tables of the detailed section or page within the document. There is currently no presecured multi-modal training data for semantic elements of different levels. Therefore, text feature-based model (i.e., LAME) is in charge of metadata extraction, and the vision-based object detection model (i.e., Vi-SEE) is responsible for the remaining semantic elements. Moreover, our post-processing delineates the realms of ambiguous semantic elements for more accurate semantic elements identification.

LA-SEE Framework
This study proposed a LA-SEE framework to extract meta-information, text, sub-titles, references, figures, tables, and captions from scientific PDFs. Figure 1 shows that proposed LA-SEE framework comprises three major components.

LAME
We adopted our prior work, LAME framework to discriminate metadata elements in the first document page, considering text block characteristics in the heterogeneous metainformation layouts [13]. Figure 2 shows that the LAME framework comprises three major components: automatic layout analysis, layout-aware training data construction, and metadata extraction. Stage 1 analyzes the PDF's first-page by using PDFMiner, then is subject to reconstruction, refinement, and adjustment procedures to identify the various metadata on the first page due to incomplete PDFMiner parsing results. Stage 2 builds the many training datasets used in Stage 3. The building process matches identified metadata from Stage 1 with previous correct metadata values. However, the compared textual content is not always precisely matched. Therefore, to determine the extent of the match, we allowed only fields with almost identical (or high similarity) matches for each layout text information element automatically acquired in the previous step as training data. We used a mixed textual-similarity measure for efficient computation based on the Levenshtein distance and bilingual evaluation understudy (BLEU) score.
The created dataset have not correct answer dataset for comparing results, and manual comparison spend much time and resource. Thus, to determine the accuracy of the training data generated through Stage 2, we indirectly evaluated the data quality through the metadata extraction in Stage 3. Finally, a novel metadata extractor is defined by pretraining the Layout-MetaBERT model with the Stage 2 training data and fine-tuning it for the target corpus.

LAME
We adopted our prior work, LAME framework to discriminate metadata elements in the first document page, considering text block characteristics in the heterogeneous meta-information layouts [13]. Figure 2 shows that the LAME framework comprises three major components: automatic layout analysis, layout-aware training data construction, and metadata extraction. Stage 1 analyzes the PDF's first-page by using PDFMiner, then is subject to reconstruction, refinement, and adjustment procedures to identify the various metadata on the first page due to incomplete PDFMiner parsing results. Stage 2 builds the many training datasets used in Stage 3. The building process matches identified metadata from Stage 1 with previous correct metadata values. However, the compared textual content is not always precisely matched. Therefore, to determine the extent of the match, we allowed only fields with almost identical (or high similarity) matches for each layout text information element automatically acquired in the previous step as training data. We used a mixed textual-similarity measure for efficient computation based on the Levenshtein distance and bilingual evaluation understudy (BLEU) score. We chose a fine-tuned Layout-MetaBERT (base) with robust metadata extraction performance (F1 = 94.6%) even for unseen journals with diverse layouts by referring to various experimental results for the LAME framework [13].  The created dataset have not correct answer dataset for comparing results, and manual comparison spend much time and resource. Thus, to determine the accuracy of the training data generated through Stage 2, we indirectly evaluated the data quality through the metadata extraction in Stage 3. Finally, a novel metadata extractor is defined by pretraining the Layout-MetaBERT model with the Stage 2 training data and fine-tuning it for the target corpus.

Vi-SEE
We chose a fine-tuned Layout-MetaBERT (base) with robust metadata extraction performance (F1 = 94.6%) even for unseen journals with diverse layouts by referring to various experimental results for the LAME framework [13]. Figure 3 describes the proposed Vi-SEE model, which utilizes ISTR [36] to detect objects in pages in the PDF document except for the first page. Images from the PDF pass through the ISTR based detection model to identify candidate bounding boxes (BBoxes) for text, titles, lists, figures, and tables. Input image passes through the convolution natural network based on the reset backbone, produces a feature pyramid. RoI (Region of Interest) feature and image feature are separated from the feature pyramid, and image feature and position feature are concatenated. Moreover, transformer encoder with dynamic attention fuses the image + position and RoI features for prediction head. Each detected area is converted into actual data through a set of post-processing procedures using the detected BBoxes corresponding categorical labels: (1) text extraction for text, lists, and titles, (2) figure/table extraction, and (3) caption extraction. Previous studies have only performed this at area-level detection, whereas the proposed modules include detailed techniques to extract precise regions for semantic element areas and related texts.

ISTR Selection
Before selecting the ISTR [36], we compared three popular object detection models, Mask R-CNN [28], DETR [30], and ISTR, for accurate semantic element extraction from the document. They all use the ResNet [46] backbone. The Mask R-CNN model is a derived image segmentation model after Faster R-CNN [27]. It has a similar structure to Faster R-CNN except its object mask branch, RoI alignment, and decoupling mask predic-

ISTR Selection
Before selecting the ISTR [36], we compared three popular object detection models, Mask R-CNN [28], DETR [30], and ISTR, for accurate semantic element extraction from the document. They all use the ResNet [46] backbone. The Mask R-CNN model is a derived image segmentation model after Faster R-CNN [27]. It has a similar structure to Faster R-CNN except its object mask branch, RoI alignment, and decoupling mask prediction and class prediction. However, the Mask R-CNN model suffers from low detection speed due to the detection pipeline's non-maximum suppression (NMS) stage.
On the other hand, DETR omits NMS from the detection pipeline while improving speed similarly to Faster R-CNN. It predicts all objects at once and only has simple pipelines that do not require NMS or anchors and is good for finding large objects, but fails to find small/middle sized objects. The ISTR algorithm provides end-to-end instance segmentation by regressing low-dimensional embeddings rather than raw masks, which enables training to be effectively conducted with a small number of matched samples. Regressing with the embeddings allows a recurrent refinement strategy that can process detection and segmentation concurrently, boosting performance. It updates query boxes and refines the prediction sets. We chose ISTR as the main Vi-SEE algorithm because there are many medium and large objects in our target documents. The primary training method of ISTR learning follows DETR [30]. A key point in ISTR learning is that there is a refinement stage. The basic formula for self-attention of ISTR is as follows: In Multi-Head Attention, a dynamic attention module is added so that RoI and image features can be well fused, and it is summarized as follows.
Furthermore, the refinement stages can improve the performance of the predicted bounding boxes, classes, and masks by updating the query boxes.
When the page of the document enters as the input of the model, object detection is performed through the ISTR model. The object detection task is to detect instances of objects of a certain class within an image by considering the bounding box area, segmentation area, and candidate labels.

•
Text extraction: BBox areas are converted into texts using PDFMiner [47] parsing results for text, lists, and titles extracted from the ISTR based model. PDFMiner returns parsed texts with position information for PDF document. We extract texts using the left-top and right-bottom positions for the detected areas. The extracted semantic elements are references, paragraphs, and section titles.
and text BBox as Tmid(x 1 , y 1 ) = (|x 2 − x 1 |, |y 2 − y 1 |) (2) Distance for each midpoint can be expressed as and midpoint for a caption as Caption text is extracted in the same way as for normal text extraction. Figure 4 shows examples of objects extracted through Vi-SEE as well as the semantic elements and their labels. ure (or table) BBox can be expressed as and text BBox as Distance for each midpoint can be expressed as (3) and midpoint for a caption as Caption text is extracted in the same way as for normal text extraction. Figure 4 shows examples of objects extracted through Vi-SEE as well as the semantic elements and their labels.

Organizing Knowledge with SEKG for Multiple Documents
Many applications that require analyzing a large amount of knowledge from various angles become possible once the knowledge relationships in S&T documents are identified, and if knowledge from different documents are interconnected. Suppose those semantic elements representing knowledge across a considerable number of documents are well organized. Then, researchers (or policymakers) can expedite their decision-making by streamlining information/knowledge collection and analysis. For example, reference [48] performed a behavioral study on citations, reference [3] extracted tasks, datasets, metrics, and scores from NLP papers to automatically construct a leaderboard, and reference [9] suggested a metaknowledge construction framework and document structure tree model to reduce gaps between human knowledge perception and entity-relationship triplets.
Influenced by those studies, we defined an SEKG for multiple document structures that can connect multiple semantic elements in a single document, or across multiple documents, as shown in Figure 5. Relationships are identified between 11 semantic elements types extracted from documents using the proposed LAME and Vi-SEE modules, and mapped under the SEKG structure. The first page of most documents includes significant metadata, including author name(s) and affiliation(s), publisher, abstract, and introduction. We regarded these metadata separately from document's contents that were not included in a specific page or section. These metadata elements provide an essential reasoning link when several documents are linked, as shown in Figure 6. ing link when several documents are linked, as shown in Figure 6.
Semantic elements extracted from a document have a hierarchical structure from the main section to the sub-section. However, it is essential to consider when figures (or tables) located on different pages can be cited more than once from different pages. Therefore, the proposed SEKG structure maps extracted semantic elements to a network node rather than the hierarchical structure, considering various relationships connecting figures, tables, and references.

Data for Metadata Extraction
We use the first pages of 65,007 PDF documents from 70 S&T journal articles to reflect various document layout formats for the metadata extraction task. It is the same dataset used in our prior work [13]. We extracted major metadata elements, such as titles, author names, author affiliations, keywords, and abstracts, in Korean and English based on the automatic layout analysis in Section 3.1. Among the 70 journal articles, two were only in Korean, 23 were only in English, and 45 were Korean and English. Automatic labeling was applied with ten labels for each layout that separated metadata on the first page of articles with other layouts not included in the relevant information labeled as O. Table 1 summarizes the automatically generated training data.  Semantic elements extracted from a document have a hierarchical structure from the main section to the sub-section. However, it is essential to consider when figures (or tables) located on different pages can be cited more than once from different pages. Therefore, the proposed SEKG structure maps extracted semantic elements to a network node rather than the hierarchical structure, considering various relationships connecting figures, tables, and references.

Data for Metadata Extraction
We use the first pages of 65,007 PDF documents from 70 S&T journal articles to reflect various document layout formats for the metadata extraction task. It is the same dataset used in our prior work [13]. We extracted major metadata elements, such as titles, author names, author affiliations, keywords, and abstracts, in Korean and English based on the automatic layout analysis in Section 3.1. Among the 70 journal articles, two were only in Korean, 23 were only in English, and 45 were Korean and English. Automatic labeling was applied with ten labels for each layout that separated metadata on the first page of articles with other layouts not included in the relevant information labeled as O. Table 1 summarizes the automatically generated training data. Large high-quality annotated training datasets are essential to creating a robust object detection model. However, accurately detecting target semantic elements from PDF documents is still not guaranteed even if similar datasets exist [10,26,49] due to varying layout formats across journals. Therefore, we constructed the proposed Vi-SEE module training dataset with the following steps.
(2) The coco-annotator API was used to modify the mask parts that were not properly labeled, as shown in Figure 7. Five paid annotators performed a cross-check on each other's work to guarantee annotation quality. Revised pages that focused on error-prone cases amounted to 20,079, summarized in Table 2. The data were randomly divided into training (i.e., fine-tuning) and testing sets at 80:20, respectively. (1) Five major semantic elements (i.e., section title, paragraph, reference, table, and figure) were pseudo-labeled for the 70 scientific journal articles using the Mask-RCNN [28] model trained with the PubLayNet [26] dataset following COCO data format [50].
(2) The coco-annotator API was used to modify the mask parts that were not properly labeled, as shown in Figure 7. Five paid annotators performed a cross-check on each other's work to guarantee annotation quality. Revised pages that focused on error-prone cases amounted to 20,079, summarized in Table 2. The data were randomly divided into training (i.e., fine-tuning) and testing sets at 80:20, respectively.     Table 3 shows the device information and the version of cuda we used in the experiments. We used i9-10900 CPU and two Tesla v100 GPUs to fine-tune comparison targets with LAME and Vi-SEE models.  Table 4 shows that the proposed LAME model effectively extracted metadata, achieving F1-score ≥ 90% for all extractions and average F1-score = 93%, confirming that pretraining the layout units with BERT schemes is feasible. Similarly, the proposed Vi-SEE model effectively detected semantic elements using vision, achieving average mAP = 85%. We performed a set of transfer learning for the constructed data before building the Vi-SEE module, based on three pre-trained models (as shown in Table 5): (1) Mask R-CNN model pre-trained with PubLayNet data, (2) DETR model pre-trained with ImageNet data, and (3) ISTR model pre-trained with ImageNet data. We used the Mask R-CNN model trained with PubLayNet data based on the De-tectron2 framework for our fine-tuning task. Both DETR and ISTR used the pre-trained ResNet-101 model [52,53] as the backbone in their fine-tuning stage. We follow the default configurations of each model.
The fine-tuned models achieved overall modest performance on AP50, whereas the ISTR based model achieved highest mAP on AP50. Semantic elements in the documents were primarily large and medium scale, but small scale when Common Object in Context (COCO) metrics were applied [54]. The ISTR based model, detects medium and large objects well, achieving superior results to DETR, whose strength lies in detecting large objects. Looking at Table 5, ISTR is about 23% higher than Mask R-CNN in average precision medium (APm) and about 6% higher than DETR. In average precision large (APl), it is about 5% higher than Mask R-CNN and about 3% higher than DETR, showing the best performance. Therefore, it better detects the area of the semantic element than others. Table 6 shows the statistics for 6,782,685 semantic elements extracted from 49,649 PDF documents using the proposed LA-SEE framework. Although semantic element counts for each type differ, this statistic is useful for estimating the number of knowledge instances acquired considering the number of input documents.

Decision Support Applications in Science and Technology Domain
When users search for desired information using search engines, such as Google or Naver, users employ relevant keywords to search for desired information using search engines, such as Google or Naver, and check search results (title, snippets, summary, and document) one by one to determine if they are relevant for their information needs. However, users often perform very repetitive searching and checking processes to access sufficient suitable information. The proposed SEKG framework provides relational information access that supports quick decision-making while reducing laborious information searches. For example, users can perform numerous reasoning types over the relationships among research data, relationships between semantic elements across multiple documents, related keywords through directly (or indirectly) linked documents, and large KG comprising triple sentences.
The SEKG can be applied to several real-world S&T applications in various fields, including, but not limited to, science knowledge guides, question answers over a large number of figures and tables, and generating textual explanations for scientific issues. We describe how SEKG satisfies scientific requirements with two applications below.

Scientific Knowledge Guide
Researchers commonly find and compare academic documents and research reports, time costs for information-seeking are rapidly increasing due to continuously increasing number of documents.
The proposed SEKG framework offers an elegant solution for the problem, providing relevant figures, tables, and captions that satisfy user requirements. A semantic query is sent to the SEKG for knowledge discovery, and the SEKG delivers a group of figures that meet the query conditions. Figure 8 shows SEKG results differ significantly from general search engine results. For example, suppose an NLP beginner examines pre-trained models published in recent studies with the query, "Text Pretrained Model". The SEKG enables easy and quick access to pretrained model pictures (e.g., BERT, TinyBERT, BART, ELECTRA, and DialogBERT) mentioned in various research papers, affiliations for authors that developed these models, and related paper titles. New knowledge, such as research trends for major research institutions, can also be summarized as required by modifying user queries over the SEKG. The proposed SEKG framework offers an elegant solution for the problem, providing relevant figures, tables, and captions that satisfy user requirements. A semantic query is sent to the SEKG for knowledge discovery, and the SEKG delivers a group of figures that meet the query conditions. Figure 8 shows SEKG results differ significantly from general search engine results. For example, suppose an NLP beginner examines pre-trained models published in recent studies with the query, "Text Pretrained Model". The SEKG enables easy and quick access to pretrained model pictures (e.g., BERT, TinyBERT, BART, ELECTRA, and DialogBERT) mentioned in various research papers, affiliations for authors that developed these models, and related paper titles. New knowledge, such as research trends for major research institutions, can also be summarized as required by modifying user queries over the SEKG.  Tables   Tables in papers and reports deliver condensed information, commonly employing numerical values to represent actual experimental performance or statistical results. Therefore, accessing the values provides many benefits to researchers. Suppose a table search is performed to satisfy the user's information request. For example, search for tables that contain captions and descriptions that match user keywords, but still require a selection process is within them. In this case, the SEKG can directly access exact values in the tables while minimizing the selection process or inferring new values based on these values.

Questions and Answering over
For example, suppose a user is interested in water pollution content in the environmental field and wants to know the mean annual pH for 2012-2014 water measurements. SEKG will select the table containing pH values from 2012, 2013, and 2014 among several water pollution documents, as shown in Figure 9.
Employing a table QA (Question-Answer) module [55] and mathematical reasoning injection [56] allows the user to obtain average pH between 2012 and 2014. Rather than  Tables   Tables in papers and reports deliver condensed information, commonly employing numerical values to represent actual experimental performance or statistical results. Therefore, accessing the values provides many benefits to researchers. Suppose a table search is performed to satisfy the user's information request. For example, search for tables that contain captions and descriptions that match user keywords, but still require a selection process is within them. In this case, the SEKG can directly access exact values in the tables while minimizing the selection process or inferring new values based on these values.

Questions and Answering over
For example, suppose a user is interested in water pollution content in the environmental field and wants to know the mean annual pH for 2012-2014 water measurements. SEKG will select the table containing pH values from 2012, 2013, and 2014 among several water pollution documents, as shown in Figure 9. For example, suppose a researcher wants to know Busan's BOD (the degree of contamination by organic substances) in 2003 and wonders if the distribution of pollutants affects BOD. In this case, SEKG finds a table of BOD content in water pollution-related documents, as shown in Figure 10, Table 2, which contains the content for Busan in 2003. For more complex questions, SEKG may search the pollutant distribution table (Figure 10, Table 1) and provide pollutant distribution for Busan ( Figure 10, Table 2). Thus, various information comprising tables, figures, and statistics can be provided to suit user requirements regardless of specific fields and data quantities, by analyzing, processing, and combining data beyond simple information provisioning.

Conclusions and Future Work
This paper proposed the LA-SEE framework to build a reusable SEKG from various documents. In particular, 11 semantic element types were defined and extracted from various S&T journals using LAME and Vi-SEE. LA-SEE uses BERT based metadata ex-traction with textual features and ISTR based object detection to achieve SOTA performance with textual and image features. As results, we established a large scale SEKG comprising 6 million semantic elements using LAME and Vi-SEE and discussed two usage scenarios (i.e., scientific knowledge guide and QA over tables) to highlight the proposed SEKG Employing a table QA (Question-Answer) module [55] and mathematical reasoning injection [56] allows the user to obtain average pH between 2012 and 2014. Rather than simply providing the information, the average value is computed using the language model's mathematical reasoning capability. Furthermore, several tables extracted from two or more document types in the same field can be processed and provided to suit the user's requirements.
For example, suppose a researcher wants to know Busan's BOD (the degree of contamination by organic substances) in 2003 and wonders if the distribution of pollutants affects BOD. In this case, SEKG finds a table of BOD content in water pollution-related documents, as shown in Figure 10, Table 2, which contains the content for Busan in 2003. For more complex questions, SEKG may search the pollutant distribution table (Figure 10, Table 1) and provide pollutant distribution for Busan ( Figure 10, Table 2). Thus, various information comprising tables, figures, and statistics can be provided to suit user requirements regardless of specific fields and data quantities, by analyzing, processing, and combining data beyond simple information provisioning. For example, suppose a researcher wants to know Busan's BOD (the degree of contamination by organic substances) in 2003 and wonders if the distribution of pollutants affects BOD. In this case, SEKG finds a table of BOD content in water pollution-related documents, as shown in Figure 10, Table 2, which contains the content for Busan in 2003. For more complex questions, SEKG may search the pollutant distribution table (Figure 10, Table 1) and provide pollutant distribution for Busan ( Figure 10, Table 2). Thus, various information comprising tables, figures, and statistics can be provided to suit user requirements regardless of specific fields and data quantities, by analyzing, processing, and combining data beyond simple information provisioning.

Conclusions and Future Work
This paper proposed the LA-SEE framework to build a reusable SEKG from various documents. In particular, 11 semantic element types were defined and extracted from various S&T journals using LAME and Vi-SEE. LA-SEE uses BERT based metadata ex-trac-

Conclusions and Future Work
This paper proposed the LA-SEE framework to build a reusable SEKG from various documents. In particular, 11 semantic element types were defined and extracted from various S&T journals using LAME and Vi-SEE. LA-SEE uses BERT based metadata extraction with textual features and ISTR based object detection to achieve SOTA performance with textual and image features. As results, we established a large scale SEKG comprising 6 million semantic elements using LAME and Vi-SEE and discussed two usage scenarios (i.e., scientific knowledge guide and QA over tables) to highlight the proposed SEKG framework applicability and extensibility. In the first scenario, it was possible to find and present figures of similar architectures belonging to semantically similar topics in several different documents through SEKG. Furthermore, in the second scenario, we showed that it is possible to present values that satisfy user needs by accessing joinable tables' values in different documents.
The limitation of this study is that the training data of 11 semantic elements of SEKG do not have consistency. Each of the LAME and Vi-SEE training data has different levels of annotation (i.e., text or vision), and multi-modal features are not considered yet. Therefore, it is necessary to construct a dataset and apply an advanced training algorithm to consider multi-modality in our future research. In addition, although the currently constructed data sets are composed of about 40 journals in different formats, there is still a limit to accurately processing S&T documents in various subject domains. Therefore, when an S&T document of a new subject domain is an input, it may be challenging to extract semantic elements, so to apply it to documents in other domains, a process of generating a new dataset and training a new model is required.
Moreover, further work remains to better handle various exceptions and errors naturally occurring due to formatting and related differences among documents. For example, LAME does not always correctly identify target elements and Vi-SEE fails to distinguish figure regions comprising complex images. We plan to employ multi-modal transformer techniques to address these issues, rather than single-modal approaches, which will require high-quality Optical Character Recognition (OCR) module(s) to convert document data into multi-modal training sets containing massive documents numbers.
We will also investigate accurately extracting related figure and table descriptions and add them as new semantic elements to SEKG. Although figures and tables are primary information in S&T documents, their corresponding descriptions are not currently considered. If SEKG were empowered with explanatory texts for figures and tables, it would be possible to build new scientific, conversational AI applications by enabling table-to-text (or figure-to-text) functionality.