Automatic Taxonomy Classification by Pretrained Language Model

: In recent years, automatic ontology generation has received signiﬁcant attention in information science as a means of systemizing vast amounts of online data. As our initial attempt of ontology generation with a neural network, we proposed a recurrent neural network-based method. However, updating the architecture is possible because of the development in natural language processing (NLP). By contrast, the transfer learning of language models trained by a large, unlabeled corpus has yielded a breakthrough in NLP. Inspired by these achievements, we propose a novel workﬂow for ontology generation comprising two-stage learning. Our results showed that our best method improved accuracy by over 12.5%. As an application example, we applied our model to the Stanford Question Answering Dataset to show ontology generation in a real ﬁeld. The results showed that our model can generate a good ontology, with some exceptions in the real ﬁeld, indicating future research directions to improve the quality.


Introduction
In recent years, the Internet has yielded various technological evolutions [1], and users have needed to collect and select information according to their purposes.In such a situation, data structures called ontologies that organize knowledge through a hierarchical structure, as shown in Figure 1, have received considerable attention, as they are required to systemize the vast amount of data on the Internet.In information science, ontologies are used for geographic information [2] and establishing a general situation awareness framework [3].However, when manually constructed, an ontology requires a large amount of time and deep knowledge of the target field.Therefore, there is a need for a mechanism to support or substitute ontologies created from unstructured text.

Introduction
In recent years, the Internet has yielded various technological evolutions [1], and users have needed to collect and select information according to their purposes.In such a situation, data structures called ontologies that organize knowledge through a hierarchical structure, as shown in Figure 1, have received considerable attention, as they are required to systemize the vast amount of data on the Internet.In information science, ontologies are used for geographic information [2] and establishing a general situation awareness framework [3].However, when manually constructed, an ontology requires a large amount of time and deep knowledge of the target field.Therefore, there is a need for a mechanism to support or substitute ontologies created from unstructured text.Automation by machine learning is indispensable with respect to meeting ontology demand, and it is difficult to generate an ontology directly from text via a single methodol-ogy.An ontology is utilized by unique and sophisticated formats such as ontology web language; it is quite different from typical language-model outputs.Additionally, the required configuration and scale of the ontology depend on the purpose.Therefore, we divided the complex ontology-generation task into the following three subtasks (Figure 2): Electronics 2021, 10, x FOR PEER REVIEW 2 of 17 Automation by machine learning is indispensable with respect to meeting ontology demand, and it is difficult to generate an ontology directly from text via a single methodology.An ontology is utilized by unique and sophisticated formats such as ontology web language; it is quite different from typical language-model outputs.Additionally, the required configuration and scale of the ontology depend on the purpose.Therefore, we divided the complex ontology-generation task into the following three subtasks (Figure 2): 1. Extracting key phrases from a target corpus; 2. Generating a taxonomic structure consisting of a hypernym-hyponym relationship from the extracted phrase set; 3. Creating detailed relationships between phrases according to the intended use of the ontology.
In this research, we focus on the automation of (2) and some cases of (3).Task (1) is a general problem in natural language processing (NLP), and sophisticated methods already exist [4,5].We can utilize those methods' ideas for our task.However, (2) and ( 3) are specific to this theme, and there is much room for research.Furthermore, these can be replaced with classification problems frequently used in machine learning.
As a classification method of phrase-pair relationships, a model combining a recurrent neural network (RNN) and word embedding has been proposed in previous research [6].The overall architecture is based on a traditional problem in the natural language processing (NLP) model, and its input part is remodeled to twin-text reader units to handle pairs of phrases.This method efficiently classifies phrase-pair relationships, especially from the perspective of memory usage and calculation cost.Because of the rapid development of language models, we can improve the method.We describe two major problems with the current method in the next section.
The first problem is handling out-of-vocabulary (OOV) words.In the existing method [7,8], each word is encoded into a feature vector by pretrained vector representation before being entered into the model.At that time, if there are some OOV words in phrases, the zero-vector space is set.For this reason, a large number of vocabulary terms and their vectors are required to improve model versatility; however, as the number of vocabulary terms increases, a larger memory is required.In particular, in tasks focused on specialized corpora, such as ontology generation, rare domain-specific words appear frequently, and the OOV problem is directly linked to model performance.Besides the pretrained model, we apply our model to generate an example of ontology for a question-answering (QA) dataset, such as The Stanford Question Answering Dataset (SQuAD v2.0).
Generating a taxonomic structure consisting of a hypernym-hyponym relationship from the extracted phrase set; 3.
Creating detailed relationships between phrases according to the intended use of the ontology.
In this research, we focus on the automation of (2) and some cases of (3).Task (1) is a general problem in natural language processing (NLP), and sophisticated methods already exist [4,5].We can utilize those methods' ideas for our task.However, (2) and ( 3) are specific to this theme, and there is much room for research.Furthermore, these can be replaced with classification problems frequently used in machine learning.
As a classification method of phrase-pair relationships, a model combining a recurrent neural network (RNN) and word embedding has been proposed in previous research [6].The overall architecture is based on a traditional problem in the natural language processing (NLP) model, and its input part is remodeled to twin-text reader units to handle pairs of phrases.This method efficiently classifies phrase-pair relationships, especially from the perspective of memory usage and calculation cost.Because of the rapid development of language models, we can improve the method.We describe two major problems with the current method in the next section.
The first problem is handling out-of-vocabulary (OOV) words.In the existing method [7,8], each word is encoded into a feature vector by pretrained vector representation before being entered into the model.At that time, if there are some OOV words in phrases, the zero-vector space is set.For this reason, a large number of vocabulary terms and their vectors are required to improve model versatility; however, as the number of vocabulary terms increases, a larger memory is required.In particular, in tasks focused on specialized corpora, such as ontology generation, rare domain-specific words appear frequently, and the OOV problem is directly linked to model performance.Besides the pretrained model, we apply our model to generate an example of ontology for a question-answering (QA) dataset, such as The Stanford Question Answering Dataset (SQuAD v2.0).
The second problem is updating the text-processing architecture.Context information means additional information generated by word order and combination.The previous method [6] uses RNN to handle context information; however, transformer architecture [9,10] becomes standard in NLP.Many researchers have published papers using transformer architecture and have shown significant progress.Empirically, we propose that using transformer architecture also yields improvement in our ontology-generation area.In conclusion, our subjects handle OOV words and architecture updates.
Our contribution can be summarized as follows.To improve the limitations of our previous approach, we propose applying a pretrained language model (PLM), such as BERT [9], to taxonomically structure generation tasks.Recently, with the advent of BERT, transfer learning has become a popular trend for language models, enabling extremely high performance by simply finetuning the target downstream tasks.Although various language processing techniques are used in this method, we focus on byte pair encoding (BPE) and transformer architecture.BPE is proposed as a simple data compression technique and is also effective for NLP's subword tokenization [11].Because BERT posted state-of-the-art scores in various tasks using transformers [12] as their core structure, many language models after BERT have adopted the transformer architecture at their base, also posting state-of-the-art results.Additionally, it is expected that the processing units have high processing power for context information.From these features, we expect significant improvements both experimentally and functionally.We also illustrate a real example to apply our ontology classifier by pretrained language model to a QA dataset.

Lexico-Syntactic-Based Ontology Generation
The lexical-syntactic approach to extract the taxonomic relations of words is a famous traditional method of ontology generation [13].The method uses the position of words and specific fixed phrase patterns in text, analyzes patterns scattered in sentences, and explores the taxonomical relationships available to the ontology.Combined with machine learning, it can help handle exceptional syntax patterns and improve accuracy.

Word Vectorization-Based Ontology Generation
As another method, the information amount and vectorization representation of words has been used for taxonomic-relation classification in ontologies.This method solves problems that do not have super/sub relationships in a lexico-syntactic approach.The method handles words with numerical representation trained by word position in a sentence, and the relationships between them are determined using a machine-learning model that takes a vector as input.The representative example of this approach is a combination of word embedding and support-vector machine learning [14].

RNN-Based Ontology Generation
RNN-based models were the first to solve the task of classifying relations between phrase pairs using grammatical semantic interpretation [6].Prior to this model, models that combined word embedding and a simple classifier for ontology generation were the mainstream, and the disadvantage was that complex terms consisting of multiple words could not be compared.By using a neural network with an RNN-based recurrent structure, it is now possible to classify compound words.This model follows the structure of the traditional sequential language model, and it is characterized by having two input layers for handling two input sentences.This model is roughly divided into four parts (embedding layer, RNN cells, concatenation process, and classification layers).Those processes are calculated in order.

Hypernym and Synonym Matching for BERT
Hypernym and synonym matching using FinBERT [15] and SBERT [16] finetuning by data augmentation of financial terms shows improved performance.
Our system has pretrained the bidirectional encoder representations from transformers (BERT) [12] and A Lite BERT (ALBERT) [17] from the scratch by Wordnet data and showed far better performance.

Ontology Generation Using the Framework
Building a knowledge graph from scratch requires training machine learning models on a huge number of datasets, in addition to powerful NLP techniques and inference capabilities.In this research [18], they solved this problem by using a third-party framework.In addition to that, they have performed data cleaning on the unstructured corpus and error checking on the KGs generated using YAGO [19] and DOLCE [20].

Relationship Classification between Phrase Pairs
To generate an ontology, it is necessary to automate the task of classifying relationships between phrases.The central element of an ontology is the hierarchical structure of concepts, but it is difficult to build the hierarchical structure all at once, and it is necessary to break down machine learning into easy-to-implement tasks.Therefore, in ontology generation, relationships between concepts (such as synonyms and hypernyms) are identified using a classification model.It then builds a hierarchical structure by organizing the acquired relationships.If the concept is interpreted as a set of phrases consisting of multiple words, it is possible to substitute it with a phrase instead of a concept.To summarize, ontology generation can be downsized into a classification problem that takes two phrases as the input and outputs the relationship between them.
To process two input-text-data and output-relation labels, there are four essential factors in classifying relationships between concepts.

•
Word embedding: By converting words into their corresponding vectors, they are converted into a format that can be easily handled by neural networks.Additionally, because ontology generation often needs to deal with rare words that are task-specific, learning those in advance in a large corpus is preferable.

•
Acquisition of context information: Phrases used in ontology generation often consist of a small number of words.However, the connection between words is stronger than between general sentences, and it is necessary to process contextual information.

•
Concatenation: because the input data for this task comprise two independent phrases, we need to combine the information at some stage in the process.

•
Classifier: we apply the information obtained in the presented steps to the classification model and calculate the final output label.

Byte Pair Encoding
As mentioned in the previous section, BPE is basically proposed as a data compression technique and is recognized for its versatility in the field of NLP [21][22][23].This method is a variable-length encoding that displays text as a series of symbols and repeatedly merges the most frequent symbol pairs into a new symbol.Compared with regular word-based methods, the number of words and associated vectors in the dictionary can be significantly reduced while maintaining expressiveness.Words that do not exist in this dictionary can be expressed by subwords.

BERT
BERT [12] is an architecture of a multiple-layered transformer array that trains deep, bidirectional representations from unlabeled texts.It is pretrained from a large unsupervised text corpus, such as Wikipedia, using the two methods of "masked word prediction" and "next sentence prediction." Unlike traditional sequential or recurrent models, the attention architecture processes the whole input sequence at once, enabling all input tokens to be processed in parallel.The layers of BERT are based on transformer architecture.The pretrained BERT model can be finetuned with just one additional layer to obtain state-of-the-art results in a wide range of NLP tasks.

ALBERT
ALBERT [17] is a next-generation BERT-based architecture with an optimized configuration.The basic structure of ALBERT is the same as BERT.However, ALBERT reduces the parameters of the embedding layer and shares the transformer layers, improving the memory usage and the calculation time.ALBERT can use more layers and a larger hidden size if the same computational resources are available, allowing for a more complex and expressive model than BERT.

Noun Phrase Extraction Syntactic Parsing
Parsing is an NLP task that extracts the dependency of a sentence by analyzing the grammatical structure and word relationships.Unlike programming languages, NLP grammars have more than one syntax tree that can be interpreted for a single sentence.Therefore, it is necessary to choose the best of multiple syntax trees.There are two types of parsing: deep learning and rule-based.With rule-based analysis, the syntax is checked in the order in which it is defined, meaning that it is often not optimal.However, dependency parsing using a deep-learning model compares several patterns for a sentence and chooses the best one.The model can be expected to provide a better result than rule-based analysis.For this purpose, we use the state-of-the-art model [24] from "Penn Treebank" [25].

Ontology Classifier Using PLM 4.1. Learning Procedure
The proposed PLM-based method uses fully divided two-stage learning: general learning and task-specific learning.In general learning, PLM is trained with tasks that require only an unlabeled corpus, making it possible to learn a large-scale corpus such as Wikipedia data.PLM acquires general and broad linguistic knowledge consisting of a large amount of sentence information.In task-specific learning, additional layers are appended at the bottom of the model for converting the output format.Then, the model is finetuned with labeled task-specific data.This research refers to the learning procedure, and we finetune PLM to taxonomic relation classification tasks.
This research utilizes a Hugging Face transformer library [26] implemented in Python.The library provides the implementation of various language models and model data that have acquired prior knowledge through the general learning process.It facilitates model comparison by collecting the latest models and generic pretraining models under a unified API.In this paper, we utilize this library to implement pretraining and general learning of the model published by google research.

Architecture
We use a simple architecture that appends the preprocessing unit, dropout regularization, and a softmax classification layer at each end of the language model.This is a standard configuration of PLMs when they are applied in classification tasks, indicating that the proposed method does not require a task-specific architecture.Figure 3 shows the overall architecture of our model in a task-specific ontology-generation stage, and it consists of the following three main stages.In this section, we describe the details and the intention of each part individually.
standard configuration of PLMs when they are applied in classification tasks, indicating that the proposed method does not require a task-specific architecture.Figure 3 shows the overall architecture of our model in a task-specific ontology-generation stage, and it con sists of the following three main stages.In this section, we describe the details and the intention of each part individually.

Preprocess
The first stage of our architecture is preprocessing.In most cases, PLMs require a specific format of the input sentences to implement subword expressions and accept in puts for various tasks.Furthermore, relationship-classification tasks require a pair o phrases as input data, and we perform the following preprocessing steps on the phrase pairs before entering them into the model (an example of these steps is shown in Table 1) 1. Concatenation and Special Token Insertion: First, the input phrase pair is concate nated into one sentence.At this time, a classifier token (CLS) is inserted at the fron of the first phrase and separator tokens (SEP) [12] are appended in the middle of the two phrases and at the end of the second phrase.2. Tokenization: The concatenated phrases are divided into subwords by a tokenize corresponding to each language model.The number of divided subwords is equal to or greater than the number of words included in the phrase.In this part, calculations for task-specific learning and actual ontology generation o some PLMs are performed (Figure 4).The number of outputs of PLMs corresponds to the number of input sequences.Because this study is a classification problem, we extract only the head vector corresponding to the [CLS] token inserted in the preprocessing.

Preprocess
The first stage of our architecture is preprocessing.In most cases, PLMs require a specific format of the input sentences to implement subword expressions and accept inputs for various tasks.Furthermore, relationship-classification tasks require a pair of phrases as input data, and we perform the following preprocessing steps on the phrase pairs before entering them into the model (an example of these steps is shown in Table 1).

1.
Concatenation and Special Token Insertion: First, the input phrase pair is concatenated into one sentence.At this time, a classifier token (CLS) is inserted at the front of the first phrase and separator tokens (SEP) [12] are appended in the middle of the two phrases and at the end of the second phrase.

2.
Tokenization: The concatenated phrases are divided into subwords by a tokenizer corresponding to each language model.The number of divided subwords is equal to or greater than the number of words included in the phrase.

Pretrained Language Model
In this part, calculations for task-specific learning and actual ontology generation of some PLMs are performed (Figure 4).The number of outputs of PLMs corresponds to the number of input sequences.Because this study is a classification problem, we extract only the head vector corresponding to the [CLS] token inserted in the preprocessing.

Classification Layers
This part classifies hidden vectors (consisting of two-phrase relationship info mation) into four types of relationship labels.First, a single linear layer translates into many vectors as class labels.Then, dropout with a probability factor of 0.1 is applied regularize and prevent overfitting, and the dropout is only applied in the training pha and not in the inference phase.Finally, the softmax classification layer outputs the prob bilities of the input text belonging to each of the class labels, such that the sum of t probabilities is 1.The softmax layer is just a fully connected neural network layer with t softmax activation function.

Phrase-Pair Relationship Datasets
As described in Section 2.1, ontology generation can be realized by relationship cla sification between phrases.The proposed method using PLMs performs pretraining in t general learning process, and fewer data are required for task-specific learning than pr vious methods, tough a large-scale dataset is still required for general ontology gener tion.Because the ultimate goal is to build an ontology, it is ideal to obtain it from a re ontology or something close to it.Therefore, in this study, we acquired data from Wor Net [27], a large-scale concept dictionary and database that has a structure similar to th of an ontology.Then, as a test for extracting ontologies from real data, we extracted nou phrases contained in sentences from SQuAD data.The rest of this chapter provides a overview of datasets and how to acquire phrase-relationship datasets.

Overview of WordNet
In WordNet, to combine concepts that can be expressed in different notations wi the same meaning into one object, a group of phrases called a synset is used as the smalle unit.All synsets on WordNet form a hierarchical structure; if synset A is a more abstra version of synset B, synset A is defined as a hypernym of synset B. From the structur WordNet can be viewed as a huge, multipurpose ontology that covers the entire languag and it can be used as teacher data in ontology generation.

Classification Layers
This part classifies hidden vectors (consisting of two-phrase relationship information) into four types of relationship labels.First, a single linear layer translates into as many vectors as class labels.Then, dropout with a probability factor of 0.1 is applied to regularize and prevent overfitting, and the dropout is only applied in the training phase and not in the inference phase.Finally, the softmax classification layer outputs the probabilities of the input text belonging to each of the class labels, such that the sum of the probabilities is 1.The softmax layer is just a fully connected neural network layer with the softmax activation function.

Phrase-Pair Relationship Datasets
As described in Section 2.1, ontology generation can be realized by relationship classification between phrases.The proposed method using PLMs performs pretraining in the general learning process, and fewer data are required for task-specific learning than previous methods, tough a large-scale dataset is still required for general ontology generation.Because the ultimate goal is to build an ontology, it is ideal to obtain it from a real ontology or something close to it.Therefore, in this study, we acquired data from WordNet [27], a large-scale concept dictionary and database that has a structure similar to that of an ontology.Then, as a test for extracting ontologies from real data, we extracted noun phrases contained in sentences from SQuAD data.The rest of this chapter provides an overview of datasets and how to acquire phrase-relationship datasets.

Overview of WordNet
In WordNet, to combine concepts that can be expressed in different notations with the same meaning into one object, a group of phrases called a synset is used as the smallest unit.All synsets on WordNet form a hierarchical structure; if synset A is a more abstract version of synset B, synset A is defined as a hypernym of synset B. From the structure, WordNet can be viewed as a huge, multipurpose ontology that covers the entire language, and it can be used as teacher data in ontology generation.

Dataset Extraction from WordNet
In this research, we extracted pairs of phrases and the four kinds of relationships (synonym, hypernym-hyponym, hyponym-hypernym, and unrelated) from the taxonomy structure of WordNet (Figure 5).The details of the datasets are shown in Table 2. First, for all noun synsets, all pairs of terms are registered in one synset as synonym pairs.Next, we extract hyponyms based on all target-noun synsets and register them as hypernymhyponym relations.Moreover, pairs of phrases where the order is reversed are labeled as hyponym-hypernym relations.Finally, we made many pairs of phrases randomly that do not have special relationships and labeled them as unrelated pairs.nym-hyponym relations.Moreover, pairs of phrases where the order is reversed are labeled as hyponym-hypernym relations.Finally, we made many pairs of phrases randomly that do not have special relationships and labeled them as unrelated pairs.

Dataset of SQuAD V2.0
The SQuADv2.0 dataset gives context and multiple questions.The answer to a question returns the location of the first and last letters, or if the answer to that question is not in the context, it returns that it cannot be answered [28].This has resulted in a dataset that  The SQuADv2.0 dataset gives context and multiple questions.The answer to a question returns the location of the first and last letters, or if the answer to that question is not in the context, it returns that it cannot be answered [28].This has resulted in a dataset that confirms the accuracy of the QA system and is more realistic.These datasets focus on getting closer to the actual QA situation.A QA system uses ontology relationships when questions and answers are created.Therefore, it is possible to extract the ontology from the QA dataset with our model.

Extraction of Nouns
In this research, we use a part-of-speech (POS) tagger [29] to extract the nouns before and after a particular word (is, are, was, were, 's, 're, such as).First, we look for sentences that have a particular word.For each sentence in each context, we use the POS tagger to extract the nouns from the sentence.Then, we divide the words into two groups: those before a particular word and those are after it.All combinations of words in the two groups are considered as candidates.

Extraction of Noun Phrases
In this study, we use a model based on encoder-decoder model using label-attentionbase with improved self-attention [24] to analyze sentences and extract none phrases.We apply the model to each sentence in each context.This allows us to recognize the sentence structure better and extract the noun phrases from it.Then, all patterns of noun phrases in each context are considered as candidates.

Training and Validation
In the experiment, we quantitatively compared the RNN-based and PLM-based methods by observing the classification accuracy.To compare the BPE with the RNN method using the existing Word2vec, we investigated the combination of the embedding layer of the BERT-based model and RNN.Using the dataset acquired from WordNet described in Table 3, we trained the model and measured the classification performance after taskspecific learning.At this time, the dataset of WordNet was randomly divided into three types: training data, validation data, and test data.Validation and test datasets each have 10,000 data, and all other data are used for training.

Model Setup
To confirm the validity of PLMs for the relationship-classification task, it is necessary to perform validation on many models.Additionally, these language models have some configurable parameters, such as the number of layers and hidden size.Similar to neural network models in other domains, verification of the effects of these parameters is also an important factor for PLMs.For these reasons, we compared the major PLMs architectures, and their configurations are summarized in Table 4.The vocabulary size of word-embedding methods is illustrated in Table 5.

Ontology Generation for the Real Text
As an ontology generation test, we use the model trained on WordNet to classify the relationship between nouns and noun phrases in a SQuAD v2.0 context.In this test, we use the model that performed best on the WordNet classification task.We use this model to determine the relationship in each SQuAD v2.0 context and confirm what kind of results were obtained.

Evaluation 7.1. Comparison of Accuracy
In this experiment, we compared the accuracy of the previous model, proposed model, and variation.These models were trained for any number of epochs by phrase-pair relationship datasets extracted from WordNet.Then, this learning process was stopped based on the validation results after empirical measurement.Finally, the classification performance was evaluated by the test datasets.The accuracy results are summarized in Table 6.The calculation times required for learning one epoch were also compared, as also shown in Table 6.Figures 6 and 7 show that the PLM-based method can achieve better performance than RNN, even in small epochs.This experiment shows that the proposed PLM-based method has significantly better accuracy than existing methods.In particular, BERT-large reaches the highest accuracy of 98.6% among the models used for comparison.By contrast, the calculation time is much longer than the existing methods, and the calculation time of the minimum configuration ALBERT-based is four times that of the RNN-based method.In comparison with the embedding method, the BERT-embed + RNN method has better accuracy while keeping smaller vocabularies than the existing Word2vec + RNN method, indicating that BPE is an effective method for ontology generation.

Comparison of Batch Size with BERT-Base
In this section, we performed additional experiments regarding batch size with the BERT-based method (Figure 8).Typically, PLM has a large number of stacked transformer layers to acquire linguistic knowledge that can support a wide variety of tasks through general learning; the bulk of this model comes at the expense of memory usage and computation time, and many PLMs are more expensive than regular language models, as shown in previous experiments.We have a batch-size limit, and in particular, large models need to be trained with a rather small batch.In previous experiments, we used a maximum batch size on each architecture because of the memory-usage limit of our machine.However, the effect of batch size on calculation speed is unknown, and we show the effect of changes in batch size on calculation time.As a result, the memory occupation ratio for This experiment shows that the proposed PLM-based method has significantly better accuracy than existing methods.In particular, BERT-large reaches the highest accuracy of 98.6% among the models used for comparison.By contrast, the calculation time is much longer than the existing methods, and the calculation time of the minimum configuration ALBERT-based is four times that of the RNN-based method.In comparison with the embedding method, the BERT-embed + RNN method has better accuracy while keeping smaller vocabularies than the existing Word2vec + RNN method, indicating that BPE is an effective method for ontology generation.

Comparison of Batch Size with BERT-Base
In this section, we performed additional experiments regarding batch size with the BERT-based method (Figure 8).Typically, PLM has a large number of stacked transformer layers to acquire linguistic knowledge that can support a wide variety of tasks through general learning; the bulk of this model comes at the expense of memory usage and computation time, and many PLMs are more expensive than regular language models, as shown in previous experiments.We have a batch-size limit, and in particular, large models need to be trained with a rather small batch.In previous experiments, we used a maximum batch size on each architecture because of the memory-usage limit of our machine.However, the effect of batch size on calculation speed is unknown, and we show the effect of changes in batch size on calculation time.As a result, the memory occupation ratio for each piece of data increases, and the batch size tends to be limited.This theory examines the dependence of the proposed method on the learning environment by measuring the effect of batch size on the model.

Ontology-Generation Experiment Results
In this section, we carried out the process to generate ontology for SQuAD v2.0 in Figure 9.We show the results of extracting taxonomical relationships from the actual data taken from the SQuAD dataset.As seen Table 7, for noun pairs, the model can reduce the number of phrases from over 1 million to 250,000, and for noun phrase pairs, the model was able to reduce the number of phrases from nearly 3 million to 560,000.The model has been trained from scratch using sentences of the SQuAD v2.0 dataset, and terms or term sequences for taxonomy classification have been selected by a sentence parser.We can remove unnecessary terms with the parser for term filtering.As there are several hundred topic categories in the SQuAD dataset, we show the generation result of a small set of the topics, such as art, music, iPad, etc. each piece of data increases, and the batch size tends to be limited.This theory examines the dependence of the proposed method on the learning environment by measuring the effect of batch size on the model.

Ontology-Generation Experiment Results
In this section, we carried out the process to generate ontology for SQuAD v2.0 in Figure 9.We show the results of extracting taxonomical relationships from the actual data taken from the SQuAD dataset.As seen Table 7, for noun pairs, the model can reduce the number of phrases from over 1 million to 250,000, and for noun phrase pairs, the model was able to reduce the number of phrases from nearly 3 million to 560,000.The model has been trained from scratch using sentences of the SQuAD v2.0 dataset, and terms or term sequences for taxonomy classification have been selected by a sentence parser.We can remove unnecessary terms with the parser for term filtering.As there are several hundred topic categories in the SQuAD dataset, we show the generation result of a small set of the topics, such as art, music, iPad, etc.Firstly, we got 47 correct taxonomical (super-sub/sub-super) classifications of 50 noun pairs generated by human evaluation (Table 8).The classifications are reasonable and reflect facts in real-life or conceptual hierarchy.Examples of misclassified pairs include: 'department'-'store' (it may come from 'department store'-'store'), 'passages'-'passages' (classified as sup-sub, but it is the synonym), 'environment'-'reputation' (may come from 'environment reputation'-'reputation'). Secondly, there are two cases that cannot be counted as correct taxonomical relations of 36 noun phrase cases (Table 9) : 'official posts'-'official posts' (synonym) and 'succession important posts'-'hereditary positions' (good for having the same meaning but difficult to be complete hypernym).Bypassing the preprocessed text to Firstly, we got 47 correct taxonomical (super-sub/sub-super) classifications of 50 noun pairs generated by human evaluation (Table 8).The classifications are reasonable and reflect facts in real-life or conceptual hierarchy.Examples of misclassified pairs include:

Conclusions and Future Work
In this paper, we applied the procedure of transfer learning and PLM technologies to ontology generation to improve the performance of relationship classification between phrase pairs.As a result, a dramatic improvement of +12.4% was achieved in terms of accuracy without sophisticated architecture.The results also show that large-model configurations provide better accuracy in PLM-based approaches.However, in terms of computational resources, the results show that ALBERT, a model with improved memory usage, fell short of existing methods.Combining learning processes such as DistilBERT [30] and TinyBERT [31] with knowledge-distillation and optimizing models such as ALBERT to solve these weaknesses is an example of future work.Although the PLM-based method has succeeded in reproducing the hierarchical structure of general-purpose ontologies such as WordNet, it does not support domain-specific ontologies.In fact, we were able to extract an ontology by using SQuAD v2.0 for noun phrases and nouns and found that our model mostly works.However, as there are some misclassified terms, we should carry out further investigation to develop a method for a better ontology.Therefore, creating an ontology that depends on purpose, quality, and uniqueness is also one of the issues to be addressed in the future.The codes are available at (https://github.com/atshb/Text2ontology)(accessed on 25 October 2021).

Figure 1 .
Figure 1.Example of a hierarchical structure of an ontology in which vehicle is the top class.
Copyright: © 2021 by the authors.Submitted for possible open access publication under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses /by/4.0/).

Figure 1 .
Figure 1.Example of a hierarchical structure of an ontology in which vehicle is the top class.

Figure 3 .
Figure 3.The learning procedure of the proposed pretrained language model-based method.

Figure 3 .
Figure 3.The learning procedure of the proposed pretrained language model-based method.

Figure 4 .
Figure 4.The architecture of the proposed pretraining-based method for taxonomic structure gen eration and the calculation process flow.

Figure 4 .
Figure 4.The architecture of the proposed pretraining-based method for taxonomic structure generation and the calculation process flow.

Figure 5 .
Figure 5.An example of the data extraction process from WordNet's taxonomic structure.

Figure 5 .
Figure 5.An example of the data extraction process from WordNet's taxonomic structure.

Figure 6 .
Figure 6.Comparison of validation loss transitions.Figure 6.Comparison of validation loss transitions.

Figure 6 .
Figure 6.Comparison of validation loss transitions.Figure 6.Comparison of validation loss transitions.

Figure 8 .
Figure 8.Effect of train batch size on calculation time with a BERT-based model.

Figure 8 .
Figure 8.Effect of train batch size on calculation time with a BERT-based model.onics 2021, 10, x FOR PEER REVIEW 13 of 17

Table 1 .
The architecture of the proposed pretraining-based method for taxonomic structure generation and the calculation process flow.

Table 1 .
The architecture of the proposed pretraining-based method for taxonomic structure generation and the calculation process flow.

Table 2 .
Details of dataset extracted from WordNet.

Table 2 .
Details of dataset extracted from WordNet.

Table 3 .
Details of distribution of the dataset.

Table 4 .
Configuration of pretrained language models used for comparison and learning parameters.

Table 5 .
Vocabulary size of word-embedding methods.

Table 6 .
Comparison of relationship-classification accuracies between phrases, with the calculation time for learning expressed as a ratio when the learning time of RNN is set to 1.0.

Table 7 .
The total number of noun pairs and noun phrases extracted from SQuAD v2.0 and the total number of ontology relations generated by the ontology generator.

Table 9 .
Result of applying the ontology generator to the extracted noun phrases.