1. Introduction
The largest and most popular Web-based, free encyclopedia such as Wikipedia covers various fields of knowledge. Due to Wikipedia authors, the number of Wikiprojects that represent different directions of scientific research is exponentially growing. Therefore, the task of identifying common information spaces in Wikipedia is becoming more important.
In connection with the constant changes in the information community, the heterogeneity of information spaces is complemented by constant dynamism. Consequently, for the adequate identification of common information spaces of Wikipedia communities, it is necessary to increase the level of text processing, including the solution of problems of semantic processing of sources. In contrast to particular words, short text fragments (i.e., collocations) include more specific semantic information of certain Wikiprojects. Therefore, the extraction of text fragments similarity, carried out using Natural Language Processing approaches, makes it possible to identify common Wikipedia communities.
It should be noted that in general the Wikipedia community is defined as “the community of contributors to the online encyclopedia Wikipedia” [
1] that can create and edit articles of Wikipedia projects in different languages and topics. However, in this study using the term “Wikipedia community”, we refer to the unity of information contained in short text fragments of dynamic Wikipedia resources of varying research directions.
In our study, we propose the information technology for identifying the semantic proximity of short text fragments in Wikipedia articles which will allow the formation of common information spaces, thereby providing relevant search and access to Wikipedia articles written on related topics.
2. Related Work
Traditionally, representing research fronts [
2] and denoting a community of scientific directions and sources, information spaces are identified on the basis of such explicit criteria as citation, co-citation, prospective links, keywords, etc.
One of the main approaches to the formation of common information spaces is the analysis of document citation. According to the approach of co-citation [
3,
4], jointly cited documents reflect the main directions of modern research and create the “core” of a specialty or branch of science.
The similar analysis of relationships is found in the method of prospective connections. In [
5], “closeness of documents” was evaluated as the number of sources that cite these documents simultaneously.
In [
6,
7] the authors defined such statistical methods of research fronts identification as the method of counting the publications number and the citation index method. In the formation of information spaces, the statistical method uses the number of publications, links and keywords, as well as the number of scientists, journals, discoveries, etc. The method for measuring the number of articles in scientific areas provides an opportunity to gain an idea about the relative level of development of individual branches of science in the formation of information spaces.
In [
8,
9,
10], a hybrid measure of publications proximity was used to identify research fronts as well. According to these approaches, the measure was calculated on the basis of three components: proximity by thematic similarity of texts, with common citation and common authors.
Generally, the number of highly cited articles and the sum of citation frequencies show the size of the research front.
However, due to continuous information changes, the use of explicit criteria is not enough to adequately form the information spaces of scientific communities.
Solving this problem, it is necessary to increase the level of natural language processing by identifying fragments of texts or phrases that are close in meaning.
The most well developed methods for determining the semantic similarity of short text fragments are the following: the method for determining synonymous collocations based on mutual information features [
11]; the method for identifying rephrases using the similarity of fragments of phrases [
12]; the method for determining context similarity based on the analysis of parallel corpora [
13,
14]. Similar studies on semantic proximity are monolingual sentence alignment algorithms [
15,
16]. In [
17,
18], the authors applied this method to study unsimplified and simplified texts in the English and Spanish languages.
All the listed approaches work either on texts of rather narrow subject areas or with statistical approaches that reflect a rather low precision of similar text fragments extraction.
3. Mathematical Model
To identify information-linguistic entities, in particular, collocations with language-specific flexibility and ambiguity, we use intellectual means for the processing of natural-language texts.
As a formal apparatus for constructing a model for extracting a discrete, finite set of similar text fragments in Wikipedia articles, we exploit the apparatus of algebra of finite predicates.
According to previous studies [
19,
20], the model formalizes semantically similar text fragments by means of grammatical and semantic characteristics of words in collocations. These characteristics distinguish the role of words in substantive, attributive and verbal collocations (the main word
x and the dependent word
y).
To define a set of grammatical and semantic characteristics of collocation words, we use
qi that formalizes the values of subject variables
ai and
ci (
Table 1).
The subject variable ai denotes grammatical characteristics of adjacent words in collocations where i signifies the following values:
(1) N—a noun functioning as one of the components of a clause is represented as follows:
NSub—Noun, Subject—a syntactic role of a noun in the sentence or the main word in the substantive collocation;
NSubOf—Noun, Subject with the preposition “of” (using the preposition “of” after the main word in the substantive collocation);
NObj—Noun, Object—a syntactic role of a noun in the sentence or the dependent word in the substantive collocation;
NObjOf—Noun, Object with the preposition “of” (using the main or dependent word with the preposition “of” in the substantive collocation).
(2) A—an adjective. The position of adjectives is considered:
AAtt—Adjective, Attributive—an adjective used as an attribute before a noun in the sentence;
APr—Adjective, Predicative—an adjective used as a nominal part of the predicate in the sentence.
(3) V—a verb. The category of transitivity is described:
VTr—Verb, Transitive—a verb without a preposition that can have a direct object;
VIntr—Verb, Intransitive—a verb that does not have a direct object.
The subject variable ci denotes semantic roles of nouns in collocations. Semantic roles link words to syntactically dependent ones and correspond to variables in the interpretation of lexical meaning.
The semantic characteristics are defined as follows:
Ag—Agent—an active participant in the situation or an initiator and controller of an action;
Att—Attribute—a link between an object and its attribute;
Pac—Patient—a passive participant in the situation or an object of an action;
Adr—Addressee—a recipient of a message;
Ins—Instrument—a participant with the help of whom an action is carried out or an action instrument used by one of the participants;
M—Location—the location of one of the participants in the situation.
Formal numbers q = {1,36} denote the possible values of grammatical and semantic characteristics of collocation words. We redefine the variable q using the predicate as follows.
In substantive, attributive and verbal collocations, a set of possible semantic and grammatical characteristics for the main collocation word is defined by the predicate
P(
x). Therefore
P(
x) = 1 if the main word of a collocation has a certain semantic-grammatical information:
A set of possible semantic and grammatical characteristics for the dependent collocation word is defined by the predicate
P(
y):
Using the set of Equations (1) and (2), the predicate of semantic equivalence between collocations consisting of pairwise synonymous words is defined as follows:
Using the algebra of finite predicates, we define the value of the predicate of semantic equivalence for three main types of collocations:
For substantive collocations: .
For attributive collocations: .
For verbal collocations: .
Example Description
Two-word collocations, formed in pairs by semantically close collocates, can be both semantically close and semantically not close. The example of semantically similar phrases is shown in
Figure 1.
The example of semantically dissimilar collocations composed of synonymous words is represented in
Figure 2.
Hence, two collocations can be identified as semantically similar if the main word x1 is synonymous with the main word x2, and the dependent word y1 is synonymous with the dependent word y2 in these collocations. Moreover, their grammatical and semantic characteristics satisfy the predicate of semantic equivalence (4).
As a result, a proposed logical-linguistic model allows distinguishing the semantic equivalence of two-word phrases due to the semantic-grammatical characteristics of the main and dependent collocates in the substantive, attributive and verbal collocations.
6. Experimental Evaluation
In order to estimate our technology, we extract similar collocations from different projects of the same portal as well as different projects of two different portals.
We devoted attention to synonymous collocations distribution by three types:
substantive collocations that are presented by two connected nouns;
attributive collocations where a noun is the main word and an adjective is the dependent word;
verbal collocations that are represented by a verb (the main word) and a noun (the dependent word).
The results give the indication of the number of synonymous collocations in articles belonging to two portals (
Table 3) and the same portal (
Table 4).
The tables show that the occurrence of synonymous collocations in the articles of one portal is more frequent than in the articles of two different portals. According to these results the articles of one Wikiportal are closer to one subject than the articles of two different Wikiportals that confirm the correctness of our model.
In addition, the proposed technology has identified the common information space of different Wikiprojects (Film and Science and academia) from different Wikiportals (Art and Biography), including articles on similar topics that have high frequency of synonymous collocations and thereby format the common Wikipedia community.
Results Analysis
Wikipedia articles cover various subject areas represented in Wikipedia projects. We have proved the hypotheses that a lot of synonymous collocations from texts, especially, related to similar topics can form common information spaces in Wikipedia communities.
In our experiments, we use precision to assess the reliability of our approach for three types of collocations. To obtain the number of correctly extracted similar collocations, we use a sample of 1000 pairs of extracted text fragments randomly identified as synonymous collocations and calculate a ratio of the number of pairs of similar collocations correctly identified according to an expert opinion to the number of our representative sample.
The value of the average precision of our approach for substantive collocations is 0.781, for attributive—0.644, and for verbal—0.627. The reason of relatively low results might be due to mistakes of the POS tagging and UD-parser. As our model identifies a set of possible grammatical and semantic characteristics of collocation words, it considerably depends on the result of parsing. Consequently, these mistakes are not determined by the chosen parser but based on morphological or/and syntactic ambiguity that is unavoidable and affects the precision of the final result.