Similar Text Fragments Extraction for Identifying Common Wikipedia Communities

: Similar text fragments extraction from weakly formalized data is the task of natural language processing and intelligent data analysis and is used for solving the problem of automatic identiﬁcation of connected knowledge ﬁelds. In order to search such common communities in Wikipedia, we propose to use as an additional stage a logical-algebraic model for similar collocations extraction. With Stanford Part-Of-Speech tagger and Stanford Universal Dependencies parser, we identify the grammatical characteristics of collocation words. With WordNet synsets, we choose their synonyms. Our dataset includes Wikipedia articles from different portals and projects. The experimental results show the frequencies of synonymous text fragments in Wikipedia articles that form common information spaces. The number of highly frequented synonymous collocations can obtain an indication of key common up-to-date Wikipedia communities.


Introduction
The largest and most popular Web-based, free encyclopedia such as Wikipedia covers various fields of knowledge.Due to Wikipedia authors, the number of Wikiprojects that represent different directions of scientific research is exponentially growing.Therefore, the task of identifying common information spaces in Wikipedia is becoming more important.
In connection with the constant changes in the information community, the heterogeneity of information spaces is complemented by constant dynamism.Consequently, for the adequate identification of common information spaces of Wikipedia communities, it is necessary to increase the level of text processing, including the solution of problems of semantic processing of sources.In contrast to particular words, short text fragments (i.e., collocations) include more specific semantic information of certain Wikiprojects.Therefore, the extraction of text fragments similarity, carried out using Natural Language Processing approaches, makes it possible to identify common Wikipedia communities.
It should be noted that in general the Wikipedia community is defined as "the community of contributors to the online encyclopedia Wikipedia" [1] that can create and edit articles of Wikipedia projects in different languages and topics.However, in this study using the term "Wikipedia community", we refer to the unity of information contained in short text fragments of dynamic Wikipedia resources of varying research directions.
In our study, we propose the information technology for identifying the semantic proximity of short text fragments in Wikipedia articles which will allow the formation of common information spaces, thereby providing relevant search and access to Wikipedia articles written on related topics.

Related Work
Traditionally, representing research fronts [2] and denoting a community of scientific directions and sources, information spaces are identified on the basis of such explicit criteria as citation, co-citation, prospective links, keywords, etc.
One of the main approaches to the formation of common information spaces is the analysis of document citation.According to the approach of co-citation [3,4], jointly cited documents reflect the main directions of modern research and create the "core" of a specialty or branch of science.
The similar analysis of relationships is found in the method of prospective connections.In [5], "closeness of documents" was evaluated as the number of sources that cite these documents simultaneously.
In [6,7] the authors defined such statistical methods of research fronts identification as the method of counting the publications number and the citation index method.In the formation of information spaces, the statistical method uses the number of publications, links and keywords, as well as the number of scientists, journals, discoveries, etc.The method for measuring the number of articles in scientific areas provides an opportunity to gain an idea about the relative level of development of individual branches of science in the formation of information spaces.
In [8][9][10], a hybrid measure of publications proximity was used to identify research fronts as well.According to these approaches, the measure was calculated on the basis of three components: proximity by thematic similarity of texts, with common citation and common authors.
Generally, the number of highly cited articles and the sum of citation frequencies show the size of the research front.
However, due to continuous information changes, the use of explicit criteria is not enough to adequately form the information spaces of scientific communities.
Solving this problem, it is necessary to increase the level of natural language processing by identifying fragments of texts or phrases that are close in meaning.
The most well developed methods for determining the semantic similarity of short text fragments are the following: the method for determining synonymous collocations based on mutual information features [11]; the method for identifying rephrases using the similarity of fragments of phrases [12]; the method for determining context similarity based on the analysis of parallel corpora [13,14].Similar studies on semantic proximity are monolingual sentence alignment algorithms [15,16].In [17,18], the authors applied this method to study unsimplified and simplified texts in the English and Spanish languages.
All the listed approaches work either on texts of rather narrow subject areas or with statistical approaches that reflect a rather low precision of similar text fragments extraction.

Mathematical Model
To identify information-linguistic entities, in particular, collocations with language-specific flexibility and ambiguity, we use intellectual means for the processing of natural-language texts.
As a formal apparatus for constructing a model for extracting a discrete, finite set of similar text fragments in Wikipedia articles, we exploit the apparatus of algebra of finite predicates.
According to previous studies [19,20], the model formalizes semantically similar text fragments by means of grammatical and semantic characteristics of words in collocations.These characteristics distinguish the role of words in substantive, attributive and verbal collocations (the main word x and the dependent word y).
To define a set of grammatical and semantic characteristics of collocation words, we use q i that formalizes the values of subject variables a i and c i (Table 1).

Substantive
x NSub/NSubOf q 1 q 2 q 3 q 4 q 5 q 6 NObjOf q 7 q 8 q 9 q 10 q 11 y NObj q 12 q 13 q 14 q 15 q 16 Attributive y AAtt q 17 APr q 18 x NSub/NSubOf q 19 q 20 q 21 q 22 q 23 q 24 NObjOf /NObj -q 25 q 26 q 27 q 28 q 29 Verbal x VTr q 30 VIntr q 31 y NObjOf /NObj -q 32 q 33 q 34 q 35 q 36 The subject variable a i denotes grammatical characteristics of adjacent words in collocations where i signifies the following values: (1) N-a noun functioning as one of the components of a clause is represented as follows: NSub-Noun, Subject-a syntactic role of a noun in the sentence or the main word in the substantive collocation; NSubOf -Noun, Subject with the preposition "of" (using the preposition "of" after the main word in the substantive collocation); NObj-Noun, Object-a syntactic role of a noun in the sentence or the dependent word in the substantive collocation; NObjOf -Noun, Object with the preposition "of" (using the main or dependent word with the preposition "of" in the substantive collocation).
(2) A-an adjective.The position of adjectives is considered: AAtt-Adjective, Attributive-an adjective used as an attribute before a noun in the sentence; APr-Adjective, Predicative-an adjective used as a nominal part of the predicate in the sentence.
(3) V-a verb.The category of transitivity is described: VTr-Verb, Transitive-a verb without a preposition that can have a direct object; VIntr-Verb, Intransitive-a verb that does not have a direct object.
The subject variable c i denotes semantic roles of nouns in collocations.Semantic roles link words to syntactically dependent ones and correspond to variables in the interpretation of lexical meaning.
The semantic characteristics are defined as follows: Ag-Agent-an active participant in the situation or an initiator and controller of an action; Att-Attribute-a link between an object and its attribute; Pac-Patient-a passive participant in the situation or an object of an action; Adr-Addressee-a recipient of a message; Ins-Instrument-a participant with the help of whom an action is carried out or an action instrument used by one of the participants; M-Location-the location of one of the participants in the situation.
Formal numbers q = {1,36} denote the possible values of grammatical and semantic characteristics of collocation words.We redefine the variable q using the predicate as follows.
In substantive, attributive and verbal collocations, a set of possible semantic and grammatical characteristics for the main collocation word is defined by the predicate P(x).Therefore P(x) = 1 if the main word of a collocation has a certain semantic-grammatical information: A set of possible semantic and grammatical characteristics for the dependent collocation word is defined by the predicate P(y): Using the set of Equations ( 1) and (2), the predicate of semantic equivalence between collocations consisting of pairwise synonymous words is defined as follows: Using the algebra of finite predicates, we define the value of the predicate of semantic equivalence for three main types of collocations: For substantive collocations: For attributive collocations: For verbal collocations:

Example Description
Two-word collocations, formed in pairs by semantically close collocates, can be both semantically close and semantically not close.The example of semantically similar phrases is shown in Figure 1.
A set of possible semantic and grammatical characteristics for the dependent collocation word is defined by the predicate P(y): Using the set of Equations ( 1) and (2), the predicate of semantic equivalence between collocations consisting of pairwise synonymous words is defined as follows: Using the algebra of finite predicates, we define the value of the predicate of semantic equivalence for three main types of collocations: For substantive collocations: For attributive collocations: For verbal collocations: .

Example Description
Two-word collocations, formed in pairs by semantically close collocates, can be both semantically close and semantically not close.The example of semantically similar phrases is shown in Figure 1.Hence, two collocations can be identified as semantically similar if the main word x1 is synonymous with the main word x2, and the dependent word y1 synonymous with the dependent word in these collocations.Moreover, their grammatical and semantic characteristics satisfy the predicate of semantic equivalence (4).The example of semantically dissimilar collocations composed of synonymous words is represented in Figure 2.
A set of possible semantic and grammatical characteristics for the dependent collocation word is defined by the predicate P(y): Using the set of Equations ( 1) and (2), the predicate of semantic equivalence between collocations consisting of pairwise synonymous words is defined as follows: Using the algebra of finite predicates, we define the value of the predicate of semantic equivalence for three main types of collocations: For substantive collocations: For attributive collocations:   Hence, two collocations can be identified as semantically similar if the main word x1 is synonymous with the main word x2, and the dependent word y1 is synonymous with the dependent word in these collocations.Moreover, their grammatical and semantic characteristics satisfy the predicate of semantic equivalence (4).Hence, two collocations can be identified as semantically similar if the main word x 1 is synonymous with the main word x 2 , and the dependent word y 1 is synonymous with the dependent word y 2 in these collocations.Moreover, their grammatical and semantic characteristics satisfy the predicate of semantic equivalence (4).
As a result, a proposed logical-linguistic model allows distinguishing the semantic equivalence of two-word phrases due to the semantic-grammatical characteristics of the main and dependent collocates in the substantive, attributive and verbal collocations.

Technology Design
Developing an information technology for identifying the semantic proximity of text fragments in Wikipedia articles of related categories to define a single information space or common fronts of scientific research, we propose using logical equations for similar collocations extraction.These equations are based on grammatical and semantic characteristics of collocation words.
The proposed technology for automatic identification of the information space of semantically connected Wikipedia data (Figure 3) includes: 1.
the extraction of semantic-grammatical characteristics of words that can potentially be elements of substantive, attributive and verbal collocations; 2. the identification of collocations, i.e., phrases formed by two adjacent word forms; In order to identify the grammatical characteristics, we exploit Stanford Part-Of-Speech (POS) tagger and Stanford Universal Dependencies (UD) parser.The tagger identifies morphological features of words and UD parser determines syntactic links between the words in a sentence; 3.
the discovery of synonymous collocation words using WordNet synsets; 4.
the identification of semantic equivalence of collocations, i.e., word combinations that have common elements of meaning.
Data 2018, 3, x FOR PEER REVIEW 5 of 9 As a result, a proposed logical-linguistic model allows distinguishing the semantic equivalence of two-word phrases due to the semantic-grammatical characteristics of the main and dependent collocates in the substantive, attributive and verbal collocations.

Technology Design
Developing an information technology for identifying the semantic proximity of text fragments in Wikipedia articles of related categories to define a single information space or common fronts of scientific research, we propose using logical equations for similar collocations extraction.These equations are based on grammatical and semantic characteristics of collocation words.
The proposed technology for automatic identification of the information space of semantically connected Wikipedia data (Figure 3) includes: 1. the extraction of semantic-grammatical characteristics of words that can potentially be elements of substantive, attributive and verbal collocations; 2. the identification of collocations, i.e., phrases formed by two adjacent word forms; In order to identify the grammatical characteristics, we exploit Stanford Part-Of-Speech (POS) tagger and Stanford Universal Dependencies (UD) parser.The tagger identifies morphological features of words and UD parser determines syntactic links between the words in a sentence; 3. the discovery of synonymous collocation words using WordNet synsets; 4. the identification of semantic equivalence of two-word collocations, i.e., word combinations that have common elements of meaning.

Data Description
In our approach, we use Wikipedia articles from different Wikiprojects [21][22][23][24] created by the community of these projects.Our dataset includes more than half a million (502,274) articles from Scheme of identifying information spaces of common semantic fragments of Wikipedia articles.

Experimental Evaluation
In order to estimate our technology, we extract similar collocations from different projects of the same portal as well as different projects of two different portals.
We devoted attention to synonymous collocations distribution by three types: • substantive collocations that are presented by two connected nouns; • attributive collocations where a noun is the main word and an adjective is the dependent word; • verbal collocations that are represented by a verb (the main word) and a noun (the dependent word).
The results give the indication of the number of synonymous collocations in articles belonging to two portals (Table 3) and the same portal (Table 4).The tables show that the occurrence of synonymous collocations in the articles of one portal is more frequent than in the articles of two different portals.According to these results the articles of one Wikiportal are closer to one subject than the articles of two different Wikiportals that confirm the correctness of our model.
In addition, the proposed technology has identified the common information space of different Wikiprojects (Film and Science and academia) from different Wikiportals (Art and Biography), including articles on similar topics that have high frequency of synonymous collocations and thereby format the common Wikipedia community.

Results Analysis
Wikipedia articles cover various subject areas represented in Wikipedia projects.We have proved the hypotheses that a lot of synonymous collocations from texts, especially, related to similar topics can form common information spaces in Wikipedia communities.
In our experiments, we use precision to assess the reliability of our approach for three types of collocations.To obtain the number of correctly extracted similar collocations, we use a sample of 1000 pairs of extracted text fragments randomly identified as synonymous collocations and calculate a ratio of the number of pairs of similar collocations correctly identified according to an expert opinion to the number of our representative sample.
The value of the average precision of our approach for substantive collocations is 0.781, for attributive-0.644, and for verbal-0.627.The reason of relatively low results might be due to mistakes of the POS tagging and UD-parser.As our model identifies a set of possible grammatical and semantic characteristics of collocation words, it considerably depends on the result of parsing.Consequently, these mistakes are not determined by the chosen parser but based on morphological or/and syntactic ambiguity that is unavoidable and affects the precision of the final result.

Conclusions and Further Work
This research provides the developed technology for analyzing the semantic similarity of Wikipedia articles of various topics and thereby identifying common Wikipedia communities.Based on the use of algebra of finite predicates, the developed model allows defining semantically similar text fragments in Wikipedia articles from different projects.The experimental results confirm the reliability of the proposed model.
The proposed technology is beneficial for retrieving more relevant documents on the Internet, in particular articles from a common information space in Wikipedia, as well simplifying the process of search engine optimization (seo) of content.Our model is one of the linguistic tool together with other approaches can be helpful in the formation of electronic catalogues of semantically connected texts in scientometric, library, and abstract systems.
Our further work will be directed at the integration of our technology in the systems of automatic generation of Wikipedia communities.We will focus on extracting paraphrases from bilingual Wikipedia articles.Our future work will also extend to studying other types of collocations such as Verb-Adverb, Adverb-Adjective, etc. that broaden the scope of the research of information spaces and can lead to more precise results.

For
collocations, formed in pairs by semantically close collocates, can be both semantically close and semantically not close.The example of semantically similar phrases is shown in Figure1.

Figure 3 .
Figure 3. Scheme of identifying information spaces of common semantic fragments of Wikipedia articles.

Figure 3 .
Figure 3.Scheme of identifying information spaces of common semantic fragments of Wikipedia articles.

Table 1 .
A set of grammatical and semantic features of collocations.

Table 3 .
Relative frequencies of synonymous collocations that occur in two different projects of two different portals.

Table 4 .
Relative frequencies of synonymous collocations that occur in two different projects of the same portal.