You are currently viewing a new version of our website. To view the old version click .
Information
  • Article
  • Open Access

13 May 2020

Modeling Popularity and Reliability of Sources in Multilingual Wikipedia

,
and
Department of Information Systems, Poznań University of Economics and Business, 61-875 Poznań, Poland
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Quality of Open Data

Abstract

One of the most important factors impacting quality of content in Wikipedia is presence of reliable sources. By following references, readers can verify facts or find more details about described topic. A Wikipedia article can be edited independently in any of over 300 languages, even by anonymous users, therefore information about the same topic may be inconsistent. This also applies to use of references in different language versions of a particular article, so the same statement can have different sources. In this paper we analyzed over 40 million articles from the 55 most developed language versions of Wikipedia to extract information about over 200 million references and find the most popular and reliable sources. We presented 10 models for the assessment of the popularity and reliability of the sources based on analysis of meta information about the references in Wikipedia articles, page views and authors of the articles. Using DBpedia and Wikidata we automatically identified the alignment of the sources to a specific domain. Additionally, we analyzed the changes of popularity and reliability in time and identified growth leaders in each of the considered months. The results can be used for quality improvements of the content in different languages versions of Wikipedia.

1. Introduction

Collaborative wiki services are becoming an increasingly popular source of knowledge in different countries. One of the most prominent examples of such free knowledge bases is Wikipedia. Nowadays this encyclopedia contains over 52 million articles in over 300 languages versions []. Articles in each language version can be created and edited even by anonymous (not registered) users. Moreover, due to the relative independence of contributors in each language, we can often encounter differences between articles about the same topic in various language versions of Wikipedia.
One of the most important elements that significantly affect the quality of information in Wikipedia is availability of a sufficient number of references to the sources. Those references can confirm facts provided in the articles. Therefore, community of the Wikipedians (editors who write and edit articles) attaches great importance to reliability of the sources. However, each language version can provide its own rules and criteria of reliability, as well as its own list of perennial sources whose use on Wikipedia are frequently discussed []. Moreover, this reliability criteria and list of reliable sources can change over time.
According to English Wikipedia content guidelines, information in the encyclopedia articles should be based on reliable, published sources. The word “source” in this case can have three interpretations []: the piece of work (e.g., a book, article, research), the creator of the work (e.g., a scientist, writer, journalist), the publisher of the work (e.g., MDPI or Springer). The term “published” is often associated with text materials in printed format or online. Information in other format (e.g., audio, video) also can be considered as a reliable source if it was recorded or distributed by a reputable party.
The reliability of a source in Wikipedia articles depends on context. Academic and peer-reviewed publications as well as textbooks are usually the most reliable sources in Wikipedia. At the same time not all scholarly materials can meet reliability criteria: some works may be outdated or be in competition with other research in the field, or even controversial within other theories. Another popular source of Wikipedia information are well-established press agencies. News reporting from such sources is generally considered to be reliable for statements of fact []. However, we need to take precautions when reporting breaking-news as they can contain serious inaccuracies.
Despite the fact that Wikipedia articles must present a neutral point of view, referenced sources are not required to be neutral, unbiased, or objective. However, websites whose content is largely user-generated is generally unacceptable. Such sites may include: personal or group blogs, content farms, forums, social media (e.g., Facebook, Reddit, Twitter), IMDb, most wikis (including Wikipedia) and others. Additionally, some sources can be deprecated or blacklisted on Wikipedia.
Given the fact that there are more than 1.5 billion websites on the World Wide Web [], it is a challenging task to assess the reliability of all of them. Additionally, the reliability is a subjective concept related to information quality [,,] and each source can be differently assessed depending on topic and language community of Wikipedia. It should also be taken into account that reputation of the newspaper or website can change over time and periodic re-assessment may be necessary.
According to the English Wikipedia content guideline []: “in general, the more people engaged in checking facts, analyzing legal issues and scrutinizing the writing, the more reliable the publication.” Related work described in Section 2 showed, that there is a field for improving approaches related to assessment of the sources based on publicly available data of Wikipedia using different measures of Wikipedia articles. Therefore, we decided to extract measures related to the demand for information and quality of articles and to use them to build 10 models for assessment of popularity and reliability of the source in different language versions in various periods. The simplest model was based on frequency of occurrence which is commonly used in other related works [,,,]. Other nine novel models used various combinations of measures related to quality and popularity of Wikipedia articles. The models were described in Section 3.
In order to extract sources from references of Wikipedia articles in different languages, we designed and implemented own algorithms in Python. In Section 4 we described basic and complex extraction methods of the references in Wikipedia articles. Based on extracted data from references in each Wikipedia article we added different measures related to popularity and quality of Wikipedia articles (such as pageviews, number of references, article length, number of authors) to assess sources. Based on the results we built rankings of the most popular and reliable sources in different languages editions of Wikipedia. Additionally, we compare positions of selected sources in reliability ranking in different language versions of Wikipedia in Section 5. We also assessed the similarity of the rankings of the most reliable sources obtained by different models in Section 6.
We also designed own algorithms in leveraging data from semantic databases (Wikidata and DBpedia) to extract additional metadata about the sources, conduct their unification and classification to find the most reliable in the specific domains. In Section 7 we showed results of analysis sources based on some parameters from citation templates (such as “publisher” and “journal”) and separately we showed the analysis the topics of sources based on semantic databases.
Using different periods we compared the result of popularity and reliability assessment of the sources in Section 8. Comparing the obtained results we were able to find growth leaders described in Section 9. We also presented the assessment of effectiveness of different models in Section 10.1. Additionally we provided information about limitation of the study in Section 10.2.

2. Recent Work

Due to the fact that source reliability is important in terms of quality assessment of Wikipedia articles, there is a wide range of works covering the field of references analysis of this encyclopedia.
Part of studies used reference counts in the models for automatic quality assessment of the Wikipedia articles. One of the first works in this direction used reference count as structural feature to predict the quality of Wikipedia articles [,]. Based on the references users can assess the trustworthiness of Wikipedia articles, therefore we consider the source of information as an important factor [].
Often references contain an external link to the source page (URL), where cited information is placed. Therefore, including in models the number of the external links in Wikipedia articles can also help to assess information quality [,].
In addition to the analysis of quantity, there are studies analyzing the qualitative characteristics and metadata related to references. One of the works used special identifiers (such as DOI, ISBN) to unify the references and find the similarity of sources between language versions of Wikipedia []. Another recent study analyzed engagement with citations in Wikipedia articles and found that references are consulted more commonly when readers cannot find enough information in selected Wikipedia article []. There are also works, which showed that a lot of citations in Wikipedia articles refer to scientific publications [,], especially if they are open-access [], wherein Wikipedia authors prefer to put recently published journal articles as a source []. Thus, Wikipedia is especially valuable due to the potential of direct linking to other primary sources. Another popular source of the information in Wikipedia is news website and there is a method for automatic suggestion of the news sources for the selected statements in articles [].
Reference analysis can be important for quality assessment of Wikipedia articles. At the same time, articles with higher quality must have more proven and reliable sources. Therefore, in order to assess the reliability of specific source, we can analyze Wikipedia articles, in which related references are placed.
Relevance of article length and number of references for quality assessment of Wikipedia content was supported by many publications [,,,,,,,]. Particularly interesting is the combination of these indicators (e.g., references and articles length ratio) as it can be more actionable in quality prediction than each of them separately [].
Information quality of Wikipedia depends also on authors who contributed to the article. Often articles with the high quality are jointly created by a large number of different Wikipedia users [,]. Therefore, we can use the number of unique authors as one of the measures of quality of Wikipedia articles [,,]. Additionally, we can take into the account information about experience of Wikipedians [].
One of the recent studies showed that after loading a page, 0.2% of the time the reader clicks on an external reference, 0.6% on an external link and 0.8% hovers over a reference []. Therefore, popularity can play an important role not only for quality estimation of information in specific language version of Wikipedia [] but also for checking reliability of the sources in it. Larger number of readers of a Wikipedia article may allow for more rapid changes in incorrect or outdated information []. Popularity of an article can be measured based on the number of visits [].
Taking into account different studies related to reference analysis and quality assessment of Wikipedia articles, we created 10 models for source assessment. Unlike other studies we used more complex methods of extraction of references and included more language versions of Wikipedia. Additionally, we used semantic layer to identify source type and metadata to create ranking of the sources in specific domains. We also took into account different periods to compare the reliability indicators of the source in various months and to find the growth leaders. Moreover, models were used to assess references based on publicly available data (Wikimedia Downloads []), so anybody can use our models for different purposes.

3. Popularity and Reliability Models of the Wikipedia Sources

In this Section we describe ten models related to popularity and reliability of the sources. In most cases source means domain (or subdomain) of the URL in references. Models are identified with abbreviations:
  • F model—based on frequency (F) of source usage.
  • P model—based on cumulative pageviews (P) of the article in which source appears.
  • PR model—based on cumulative pageviews (P) of the article in which source appears divided by number of the references (R) in this article.
  • PL model—based on cumulative pageviews (P) of the article in which source appears divided by article length (L).
  • Pm model—based on daily pageviews median (Pm) of the article in which source appears.
  • PmR model—based on daily pageviews median (Pm) of the article in which source appears divided by number of the references (R) in this article.
  • PmL model—based on daily pageviews median (Pm) of the article in which source appears divided by article length (L).
  • A model—based on number of authors (A) of the article in which source appears.
  • AR model—based on number of authors (A) of the article in which source appears divided by number of the references (R) in this article.
  • AL model—based on number of authors (A) of the article in which source appears divided by article length (L).
Frequency of source usage in F model means how many references contain the analyzed domain in URL. This method was commonly used in related works [,,,]. Here we take into account a total number of appearances of such reference, i.e., if the same source is cited 3 times, we count the frequency as 3. Equation (1) shows the calculation for F model.
F ( s ) = i = 1 n C s ( i ) ,
where s is the source, n is a number of the considered Wikipedia articles, C s ( i ) is a number of references using source s (e.q. domain in URL) in article i.
Pageviews, i.e., number of times a Wikipedia article was displayed, is correlated with its quality []. We can expect that articles read by many people are more likely to have verified and reliable sources of information. The more people read the article the more people can notice inappropriate source and the faster one of the readers decides to make changes.
P model includes additionally to the frequency of source also cumulative pageviews of the article in which this source appears. Therefore, the source that was mentioned in a reference in a popular article can have bigger value then source that was mentioned even in several less popular articles. Equation (2) presents the calculation of measure using P model.
P ( s ) = i = 1 n C s ( i ) · V ( i ) ,
where s is the source, n is a number of the considered Wikipedia articles, C s ( i ) is a number of references using source s (e.q. domain in URL) in article i, V ( i ) is cumulative pageviews value of article i.
PR model uses cumulative pageviews divided by the total number of the references in a considered article. Unlike the previous model here we take into account visibility of the references using the analyzed source. We assume that in general the more references in the article, the less visible the specific reference is Equation (3) shows the calculation of measure using PR model.
P R ( s ) = i = 1 n V ( i ) C ( i ) · C s ( i ) ,
where s is the source, n is a number of the considered Wikipedia articles, C ( i ) is total number of the references in article i, C s ( i ) is a number of the references using source s (e.q. domain in URL) in article i, V ( i ) is cumulative pageviews value of article i.
Another important aspect of the visibility of each reference is the length of the entire article. Therefore, we provide additional PL model that operates on the principles described in Equation (4).
P L ( s ) = i = 1 n V ( i ) T ( i ) · C s ( i ) ,
where s is the source, n is a number of the considered Wikipedia articles, T ( i ) is the length of source code (wiki text) of article i, C s ( i ) is a number of references using source s (e.q. domain in URL) in article i, V ( i ) is cumulative pageviews value of article i.
Popularity of an article can be measured in different ways. As it was proposed in [] we decided to measure pageviews also as daily pageviews median (Pm) of individual articles. Thereby we provided additional models Pm, PmR, PmL that are modified versions of models P, PR, PL, respectively. The modification consists in replacement of cumulative pageviews with daily pageviews median.
As the pageviews value of article is more related to readers, we also propose a measure addressing the popularity among authors, i.e., number of users who decided to add content or make changes in the article. Given the assumptions of previous models we propose analogous models related to authors: models A, AR, AL are described in Equations (5)–(7), respectively.
A ( s ) = i = 1 n C s ( i ) · E ( i ) ,
where s is the source, n is a number of the considered Wikipedia articles, C s ( i ) is a number of references using source s (e.q. domain in URL) in article i, E ( i ) is total number of authors of article i.
A R ( s ) = i = 1 n E ( i ) C ( i ) · C s ( i ) ,
where s is the source, n is a number of the considered Wikipedia articles, C ( i ) is total number of the references in article i, C s ( i ) is a number of references using source s (e.q. domain in URL) in article i, E ( i ) is total number of authors of article i.
A L ( s ) = i = 1 n E ( i ) T ( i ) · C s ( i ) ,
where s is the source, n is a number of the considered Wikipedia articles, T ( i ) is the length of source code (wiki text) of article i, C s ( i ) is a number of references using source s (e.q. domain in URL) in article i, E ( i ) is total number of authors of article i.
It is important to note that for pageviews measures connected with sources extracted in the end of the assessed period we use data for the whole period (month). For example, if references were extracted based on dumps as of 1 March 2020, then we considered pageviews of the articles for the whole February 2020.

4. Extraction of Wikipedia References

Wikimedia Foundation back-ups each language version of Wikipedia at least once a month and stores it on a dedicated server as “Database backup dumps”. Each file contains different data related to Wikipedia articles. Some of them contain source codes of the Wikipedia pages in wiki markup, some of them describe individual elements of articles: headers, category links, images, external or internal links, page information and others. There are even files that contain the whole edit history of each Wikipedia page.
Variety of dump files gives possibility to extract necessary data in different ways. Some of them allow to get results in a relatively short time using simple parser. However, other important information may be missing in such files. Therefore, in this section we describe two methods of extracting the data about references in Wikipedia.

4.1. Basic Extraction

References have often links to different external sources (websites). For each language version of Wikipedia we used dump file with external URL link records in order to extract the URLs from rendered versions of Wikipedia article. For instance, for English Wikipedia we used dump file from March 2020-“enwiki-20200301-externallinks.sql.gz”. This file contains data about external links placed in all pages in selected language version of Wikipedia. Therefore, we took into account only links placed in article namespace (ns0). We extracted over 280 million external links from 55 considered language versions of Wikipedia. Table 1 shows the extraction statistics based on dumps from March 2020: total number of articles, number of articles with a certain number of external links (URLs), total and unique number of external links in different language versions of Wikipedia.
Table 1. Total number of articles, number of articles with a certain number of external links (URLs), total and unique number of external links in different language versions of Wikipedia. Source: own calculations based on Wikimedia dumps in March 2020 using complex extraction of references.
Analysis of the external links showed that the largest share of articles with at least one link is placed in Swedish Wikipedia—96%. English Wikipedia has slightly less value of this indicator—about 91% articles with at least 1 external link. However, English Wikipedia has the largest share of articles with at least 100 external links—1% of all articles in this language. The biggest total number of external links per 1 article has Catalan (12.7), English (11.5) and Russian (10.1) Wikipedia.
Based on the extraction of external links, we can find which of the domains (or subdomains) are often used in Wikipedia articles. Figure 1 shows the most popular domains (and subdomains) in over 280 million external links from 55 language versions of Wikipedia.
Figure 1. The most popular domains in over 280 million external links from 55 language versions of Wikipedia. Source: own calculations based on Wikimedia Dumps as of March 2020 using basic extraction method. The most popular domains in external links in other language versions are available on the web page: http://data.lewoniewski.info/sources/basic.
It is important to note that despite the fact that imdb.com (Internet Movie Database) included in the list of sites which are generally unacceptable in English Wikipedia [], this resource is on the 2nd planes in the list of the most commonly used websites in Wikipedia articles. The top 10 of the most commonly used websites also contains: web.archive.org (Wayback Machine), viaf.org (Virtual International Authority File), int.soccerway.com (Soccerway-website on football), tvbythenumbers.zap2it.com (TV by the Numbers), animaldiversity.org (Animal Diversity Web), deadline.com (Deadline Hollywood), variety.com (Variety-american weekly entertainment magazine), webcitation.org (WebCite-on-demand archiving service), officialcharts.com (The Official UK Charts Company).
Obtained results can be used for further analysis. However, basic extraction method next to its its relative simplicity, have some disadvantages. For example, we can extract all external links from article using basic extraction method but we will miss information about placement of each link in article (e.q. if it was placed in reference). Another problem is excluding not relevant links such as archived copy of the source (when the original copy in presented and available), links generated automatically if the source has special identifiers or templates, links to other pages of Wikimedia projects (often they show additional information about the article but not the source of information) and others. Therefore, we decided to conduct a more complex extraction based on source code of each Wikipedia article. This method is described in the next subsection.

4.2. Complex Extraction

Using Wikipedia dumps from March 2020, we have extracted all references from over 40 million articles in 55 language editions that have at least 100,000 articles and at least 5 article depth index in recent years as it was proposed in []. Complex extraction was based on source code of the articles. Therefore, we used other dump file (comparing to basic extraction)-for example dump file as of March 2020 for English Wikipedia that we used is “enwiki-20200301-pages-articles.xml.bz2”.
In wiki-code references are usually placed between special tags <ref>…</ref>. Each reference can be named by adding “name” parameter to this tag: <ref name=”...”>...</ref>. After such reference was defined in the articles, it can be placed elsewhere in this article using only <ref name=”...” />. This is how we can use the same reference several times using default wiki markup. However, there are other possibilities to do so. Depending on language version of Wikipedia we can also use special templates with specific names and set of parameters. It is not even mandatory that some of them must be placed under <ref>...</ref> tag.
In general, we can divide references into two groups: with special template and without it. In the case of references without special template they usually have URL of source and some optional description (e.g., title). References with special templates can have different data describing the source. Here in separate fields one can add information about author(s), title, URL, format, access date, publisher and others. The set of possible parameters with predefined names depends on language version and type of templates, which can describe book, journal, web source, news, conference and others. Figure 2 shows the most commonly used templates in <ref> tags in English. Among the most commonly used templates in this Wikipedia language versions are: ’Cite web’, ’Cite news’, ’Cite book’, ’Cite journal’, National Heritage List for England (’NHLE’), ’Citation’, ’Webarchive’, ’ISBN’, ’In lang’, ’Dead link’, Harvard citation no brackets (’Harvnb’), ’Cite magazine’. In order to extract information about sources we created own algorithms that take into account different names of reference templates and parameters in each language version of Wikipedia. The most commonly used parameters in this language version are: title, url, accessdate, date, publisher, last, first, work, website and access-date.
Figure 2. The most popular templates used in references in English Wikipedia. Source: own calculations based on Wikimedia Dumps as of March 2020. The most popular templates in other language versions are presented on the web page: http://data.lewoniewski.info/sources/templates.
It is important to note that the presence of some references cannot be identified directly based on the source (wiki) code of the articles. Sometimes infoboxes or other templates in the Wikipedia article can put additional references to the rendered version of article. Figure 3 shows such situation on example of table with references in the Wikipedia article “2019–2020 coronavirus pandemic” that was added using template “2019–2020 coronavirus pandemic data”. In our approach we include such references in the analysis when such templates appear in the Wikipedia articles.
Figure 3. Table with references in the Wikipedia article 2019–2020 coronavirus pandemic that was added using template 2019–2020 coronavirus pandemic data. Source [].
Some of the most popular templates allows to add identifiers to the source such as DOI, JSTOR, PMC, PMID, arXiv, ISBN, ISSN, OCLC and others. Some references can include special templates related to identifiers such DOI, ISBN, ISSN can be described as separate templates. For example, value for “doi” parameter can be written as “doi|...”. Moreover, some of the templates allow to insert several identifiers for one reference-templates for ISBN, ISSN identifiers allows to put two or more values-for example we can put in code “ISBN|...|...” or “ISSN|...|...|...”. Table 2 shows the extraction statistics of the references with DOI, ISBN, ISSN, PMID, PMC identifiers. Table 3 shows the extraction statistics of the references with arXiv, Bibcode, JSTOR, LCCN, OCLC identifiers.
Table 2. Total and unique number of references with special identifiers: DOI, ISBN, ISSN, PMID, PMC. Source: own calculations based on Wikimedia dumps as of March 2020 using complex extraction of references.
Table 3. Total and unique number of references with special identifiers: arXiv, Bibcode, JSTOR, LCCN, OCLC. Source: own calculations based on Wikimedia dumps as of March 2020 using complex extraction of references.
Special identifiers can determine similarity between the references even though they have different parameters in description (e.g., titles in another languages). Unification of these references can be done based on identifiers. For example, if a reference has DOI number “10.3390/computers8030060”, we give it URL “https://doi.org/10.3390/computers8030060”. More detailed information about identifiers which we used to unifying the references is shown in Table 4.
Table 4. Identifiers that were used for URL unification of references.
One of the advantages of the complex method of extraction (comparing to basic one, which was described in previous subsection) is ability to distinguish between types of source URLs: actual link to the page and archived copy. For linking to web archiving services such as the Wayback Machine, WebCite and other web archiving services special template “Webarchive” can be used. In most cases the template needs only two arguments, the archive url and date. This template is used in different languages and sometimes has different names. Additionally, in a single language this template can be called using other names, which are redirects to original one. For example in English Wikipedia alternative names of this templates can be used: “Weybackdate”, “IAWM”, “Webcitation”, “Wayback”, “Archive url”, “Web archive” and others. Using information from those templates we found the most frequent domains of web archiving services in references.
It is important to note that depending on language version of Wikipedia template about archived URL addresses can have own set of parameters and own way to generate final URL address of the link to the source. For example, in the English Wikipedia template Webarchive has parameter url which must contain full URL address from web archiving service. At the same time related template Webarchiv in German Wikipedia has also other ways to define a link to archived source-one can provide URL of the original source page (that was created before it was archived) using url parameter and (or) additionally use parameters depending on the archive service: “wayback”, “archive-is”, “webciteID” and others. In this case, to extract the full URL address of the archived web page, we need to know how inserted value of each parameter affects the final link for the reader of the Wikipedia article in each language version.
In the extraction we also took into account short citation from “Harvard citation” family of templates which uses parenthetical referencing. These templates are generally used as in-line citations that link to the full citation (with the full meta data of the source). This enables a specific reference to be cited multiple times having some additional specification (such as a page number) with other details (comments). We included in the analysis following templates: “Harvnb” (Harvard citation), “harvnb” (Harvard citation no brackets), “Harvtxt” (Harvard citation text), “Harvcol”, “Harvcolnb”, “Sfn” (Shortened footnote template) and others. Depending on language version of Wikipedia, each template can have another corresponding name and additional synonymous names. For example in English Wikipedia, “Harvard citation”, “Harv” and “Harvsp” mean the same template (with the same rules), while corresponding template in French has such names as “Référence Harvard”, “Harvard” and also “Harv”.
Taking into account unification of URLs based on special identifiers, excluding URLs of archived copies of the sources and including special templates outside <ref> tags, we counted the number of all and unique references in each considered language version. Table 5 presents total number of articles, number of articles with at least 1 reference, at least 10 references, at least 100 references and number of total and unique number of references in each considered language version of Wikipedia.
Table 5. Total number of articles, number of articles with at least 1 reference, at least 10 references, at least 100 references and number of total and unique number of references in each considered language version of Wikipedia. Source: own calculation based on Wikimedia dumps as of March 2020 using complex extraction of references.
Analysis of the numbers of the references extracted by complex extraction showed other statistics comparing to basic extraction of the external links described in Section 4.1. The largest share of the article with at least one references has Vietnamese Wikipedia-84.8%. Swedish, Arabic, English and Serbian Wikipedia has 83.5%, 79.2%, 78.2% and 78.1% share of such articles, respectively. If we consider only articles with at least 100 references, then the largest share of such articles will have Spanish Wikipedia-3.5%. English, Swedish and Japanese Wikipedia has 1.1%, 0.9% and 0.8% share of such articles, respectively. However, the largest total number of the references per number of articles has English Wikipedia—9.6 references. Relatively large number of references per article has also Spanish (9.2) and Japanese (7.1) Wikipedia.
The largest number of the references with DOI identifier has English Wikipedia (over 2 million) at the same time has the largest number of average number of references with DOI per article—34.3%. However, the largest share of the references with DOI among all references has Galician (8.4%) and Ukrainian (6.6%) Wikipedia.
The largest number of the references with ISBN identifier has English Wikipedia (over 3.5 million) at the same time has the largest number of average number of references with ISBN per article-34.3%. However, the largest share of the references with ISBN among all references has Kazakh (20.3%) and Belarusian (13.1%) Wikipedia.
Based on the extraction of URLs from the obtained references, we can find which of the domains (or subdomains) are often used in Wikipedia articles. Figure 4 shows the most popular domains (and subdomains) in over 200 million references of Wikipedia articles in 55 language versions. Comparing results with basic extraction (see Section 4.1) we got some changes in the top 10 of the most commonly used sources in references: deadline.com (Deadline Hollywood), tvbythenumbers.zap2it.com (TV by the Numbers), variety.com (Variety-american weekly entertainment magazine), imdb.com (Internet Movie Database), newspapers.com (historic newspaper archive), int.soccerway.com (Soccerway-website on football), web.archive.org (Wayback Machine), oricon.co.jp (Oricon Charts), officialcharts.com (The Official UK Charts Company), gamespot.com (GameSpot-video game website).
Figure 4. The most popular domains in URL of references of Wikipedia articles in 55 language versions. Source: own calculations based on Wikimedia Dumps as of March 2020 using complex extraction method.

5. Assessment of Sources

To assess the references based on prooped models apart from extraction of the source we also extracted data related to pageviews, lenght of the articles and number of the authors. We used different dumps files that are available on “Wikimedia Downloads” [].
Based on complex extraction method we measure popularity and reliability of the sources in references. Due to limitation of the size in this paper we often used F or PR model to show various ranking of sources. The exception is situations where we compared 10 proposed models for popularity and reliability assessment of the sources in Wikipedia. Additionally in the tables we limit number of the languages to one of the most developed: Arabic (ar), German (de), English (en), Spanish (es), Persian (fa), French (fr), Italian(it), Japanese(ja), Dutch (nl), Polish (pl), Portuguese (pt), Russian (ru), Swedish (sv), Vietnamese (vi), Chinese (zh). The more extended version of the results are placed on the web page: http:/data.lewoniewski.info/sources/. For example, figures that shows the most popular and reliable sources for each of considered language version of Wikipedia using F-model (http://data.lewoniewski.info/sources/modelf) and PR-model (http://data.lewoniewski.info/sources/modelpr) placed there.
Table A1 shows position in the local rankings of the most popular and reliable sources in one of the most developed language versions of Wikipedia in February 2020 using PR model. In this table it is possible to compare rank of the source that has leading position in at least one language version to other languages. For example, “taz.de” (Die Tageszeitung) is on 3rd place in German Wikipedia in February 2020, at the same time this source is on 692nd, 785th and 996th place in French, Persian and Polish Wikipedia respectively in the same period. In French Wikipedia the most reliable source in February 2020 was “irna.ir” (Islamic Republic News Agency), at the same time in English Wikipedia it is on 8072nd place. However this source not mentioned at all in Polish and Swedish Wikipedia. Other example-in Russian Wikipedia the most reliable source in February 2020 “lenta.ru” was on the 1st place, at the same time it is on the 166th, 310th, 325th and 352nd in Polish, Vietnamese, German and Arabic Wikipedia. There also sources, that has relatively high position in all language versions: “variety.com” and deadline.com always in the top 20, “imdb.com” almost in all languages (except Japanese) in the top 20, ’who.int’ in the top 100 of reliable sources in each considered languages.

6. Similarity of Models

According to the results presented in the previous section, each source can be placed on a different position in the ranking of the most reliable sources depending on the model. It is worthwhile to check how similar are the results obtained by different models. For this purpose we used Spearman’s rank correlation to quantify, in a scale from −1 to 1 degree, which variables are associated. Initially we took only sources that appeared in the top 100 in at least one of the rankings of the most popular and reliable sources in multilingual Wikipedia in February 2020. Altogether, we obtained 180 sources and their positions in each of the rankings. Table 6 shows Spearman’s correlation coefficients between these rankings.
Table 6. Spearman’s correlation coefficients between rankings of the top 100 most popular and reliable sources in multilingual Wikipedia in February 2020 using different models.
We can observe that the highest correlation is between rankings based on P and Pm model–0.99. This can be explained through similarities of the measures in models—the first is based on cumulative page views and the latter on median of daily page views in a given month.
Another pair of similar rankings is PL and PR models—0.98. Both measures use total page views data. In the first model value of this measure is divided by the number of references, in the second by article length. As we mentioned before in Section 2 and Section 3, the number of references and article lengths are very important in quality assessment of the Wikipedia articles and are also correlated—we can expect that longer articles can have a bigger number of references.
In connection with previously described similarities between P and Pm, we can also explain similarity between models PL and PmR with 0.97 value of the Spearman’s correlation coefficient.
The lowest similarity is between F and P model–0.37. It comes from different nature of these measures. In Wikipedia anyone can create and edit content. However, not every change in the Wikipedia articles can be checked by a specialist in the field, for example by checking reliability of the inserted sources in the references. Despite the fact that some sources are used frequently, there is a chance that they have not been verified yet and not replaced by more reliable sources. The next pair of rankings with the low correlation is Pm and F model. Such low correlation is obviously connected with similarity of the page view measures (P and Pm).
It is also important to note the low similarity between rankings based on AR and P models–0.41. Such differences can be connected with the measures that are used in these models. AR model uses the number of authors for whole edition history of article divided by the number of references whereas P uses page view data for selected month.
In the second iteration we extended the number of sources to top 10,000 in each ranking of the most popular and reliable sources in multilingual Wikipedia in February 2020. We obtained 19,029 sources. Table 7 shows Spearman’s correlation coefficients between these extended rankings.
Table 7. Spearman’s correlation coefficients between rankings of the top 10,000 most popular and reliable sources in multilingual Wikipedia in February 2020 using different models.
In case of extended rankings (top 10,000) there are no significant changes with regard to the the Spearman’s correlation coefficient values compared to the top 100 model in Table 6. However, it should be noted that the largest difference in values of coefficients appears between PR and A model–0.26 (0.82 in the top 100 and 0.56 in the top 10,000).
The heatmap in Figure 5 shows Spearman’s correlation coefficients between rankings of the top 100 most reliable sources in each language version of Wikipedia in February 2020 obtained by F-model in comparison with other models.
Figure 5. Spearman’s correlation coefficients between rankings of the top 100 most reliable sources in each language version of Wikipedia in February 2020 obtained by F-model in comparison with other models. Interactive version of the heatmap is available on the web page: http://data.lewoniewski.info/sources/heatmap/.
Comparing the results of Spearman’s correlation coefficients within each of considered language version of Wikipedia, we can find that the largest average correlation between F-model and other models is for Japanese (ja) and English (en) Wikipedia—0.61 and 0.59, respectively. The smallest average value of the correlation coefficients among languages have Catalan (ca) and Latin (la) Wikipedia—0.16 and 0.19, respectively. Considering coefficient values among all languages of each pair F-model and other model, the largest average value has F/AL-model pairs (0.71), the smallest—F/PmR-models (0.18).

7. Classification of Sources

7.1. Metadata from References

Based on citation templates in Wikipedia we are able to find more information about the source: authors, publication date, publisher and other. Using such metadata we decided to find which of the publishers and journals are most popular and reliable.
We first analyzed values of the publisher parameter in citations templates of the references of articles in English Wikipedia (as of March 2020). We found over 18 million references with citation templates that have value in the publisher parameter. The Figure 6 shows the most commonly used publishers based on such analysis.
Figure 6. The most commonly used titles in publisher parameter of citations templates in the references of articles in English Wikipedia in March 2020. Source: own calculations based on Wikimedia dumps using complex extraction method.
Within the parameter publisher in references, the following names are most often found: United States Census Bureau, Oxford University Press, BBC, BBC Sport, Cambridge University Press, Routledge, National Park Service, AllMusic, Yale University Press, BBC News, Prometheus Global Media, United States Geological Survey, ESPN, CricketArchive, International Skating Union, Official Charts Company.
Using different popularity and reliability models we assessed all journals based on the related parameter in citation templates placed in references of English Wikipedia. Table 8 shows the most popular and reliable publishers with position in the ranking depending on the model.
Table 8. Position in rankings of publishers in English Wikipedia depending on popularity and reliability model in February 2020. Source: own calculation based on Wikimedia dumps using complex extraction and using only values from publisher parameter of citation templates in references. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table8.
Comparing the differences between ranking positions of the publishers using different models, we observed that some of the sources always have leading position: Oxford University Press (1st or 2nd place depending on model), BBC (2nd-5th place), Cambridge University Press (2nd-5th place), Routledge (3rd-6th place), BBC News (5th-10th place).
Some of the publisher has a high position in few models. For example, “United States Census Bureau” has the 1st place in F model (frequency) and AR model (authors per references count). At the same time in P (pageviews) model and PL model (pageviews per length of the text), this source took 27th and 11th places, respectively. Another one of the most frequent publisher in Wikipedia-’National Park Service’ took 7th place. However it took only 94th and 58th places in P (pageviews) and PmL (pageviews median per length of the text) models, respectively. Publisher “Springer” took 5th place in PmR model (pageviews median per references count), but took only 19th place in F model (frequency). CNN took 2nd place in P (pageviews) and Pm (pageviews median) model, but at the same time took 22nd and 16th places in F (frequency) and AR (authors per references count) model, respectively. Wikimedia Foundation as a source in P (pageviews) model is in the top 10 sources, but at the same time is far from leading position in F (frequency) and AR (authors per length of the text) model—5541st and 3008th places, respectively.
It is important to note, that this ranking of publishers only take into account references with filled publisher parameter in citation templates in English Wikipedia, therefore it can not show complete information about leading sources in different languages (especially in those languages where citation templates are used rarely used).
Next we extracted values of journal parameter in citations templates of the references from articles in English Wikipedia. We found over 3 million references with citation templates that have value in the journal parameter. The Figure 7 shows the most commonly used journals based on such analysis. The most commonly used journals were: Nature, Astronomy and Astrophysics, Science, The Astrophysical Journal, Lloyd’s List, PLOS ONE, Monthly Notices of The Royal Astronomical Society, The Astronomical Journal, Billboard.
Figure 7. The most commonly used titles in the journal parameter of citations templates in the references of articles in English Wikipedia in March 2020. Source: own calculations based on Wikimedia dumps using complex extraction method.
Using different popularity and reliability models we assessed all journals based on the related parameter in citation templates placed in references of English Wikipedia. Table 9 shows the most popular and reliable journals with position in the ranking depending on the model. It is important to note that the same journal has two different names “Astronomy and Astrophysics” and “Astronomy & Astrophysics” because it was written in such ways in citation templates.
Table 9. Position in rankings of journals in English Wikipedia depending on popularity and reliability model in February 2020. Source: own calculation based on Wikimedia dumps using complex extraction and using only values from journal parameter in citation templates in references. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table9.
Comparing the differences between ranking positions of the journals using different models, we can also observe that some of the sources always have leading position: Nature (1st in all models), Science (2nd-3rd place depending on model), PLOS ONE (3rd-6th place), The Astrophysical Journal (4th-7th place).
Some of journals has a high position in few models. For example, “Lancet” journal took 3rd place in P (pageviews) and Pm (pageviews median) model, but is only on the 23rd place in F (frequency) model. Another example, “Proceedings of the National Academy of Sciences of the United States of America” has the 4th place in PmR model (pageviews median per references count) and at the same time 13th place in F (frequency) model. “Proceedings of The National Academy of Sciences” took 8th place in PmR model (pageviews per references count), but has 18th position in F model (frequency). There are journals that have signifficatly fidderent position depends on model. One of the good examples—“MIT Technology Review” that took 5th place in P model (pageviews), but only 5565th and 3900th places in F (frequency) and AR (authors count per references count) model, respectively.
Despite the fact that obtained results allow us to compare different meta data related to the source, we need to take into account significant limitation of this method-we can only assess the sources in references that used citation templates. Additionally, as we already discussed in Section 4.2, not always related parameters of the references are filled by Wikipedians. Therefore, we decided to take into account all references with URL address and conducted more complex analysis of the source types based on semantic databases.

7.2. Semantic Databases

Based on information about URL it is possible to identify title and other information related to the source. Using Wikidata [,] and DBpedia [,] we found over 900 thousand items (including such broadcasters, periodicals, web portals, publishers and other) which has aligned separate domain(s) or subdomain(s) as official site. Table 10 shows position in the global ranking of the most popular and reliable source with identified title based on found items in 55 considered language versions of Wikipedia in February 2020 using different models with identified title of the source
Table 10. Position in the global ranking of the most popular and reliable sources with identified title in 55 considered language versions of Wikipedia depending on the model in February 2020. Source: own calculations based on Wikimedia dumps using complex extraction of references. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table10.
Leading positions in various models are occupied by following sources: Deadline Hollywood, TV by the Numbers, Variety, Internet Movie Database. “Forbes”, “The Washington Post”, “CNN”, “Entertainment Weekly”, “Oricon” are in the top 20 of all rankings in Table 10. We can also observe sources with relative big differences in rankings between the models. For example, “Newspapers” (historic newspaper archive) in on the 5th place of the most frequent used sources in Wikipedia, at the same time is on 33rd and 23rd place in Pm (pageviews median) and PmL (pageviews median per length of the text) models respectively. Another example, “Soccerway” is on the 7th place in the ranking of the most commonly used sources (based on F model), but is on 116th and 100th places in P and Pm models, respectively. Despite the fact that “American Museum of Natural History” is on top 20 the most commonly used sources in Wikipedia (based on F model), it is excluded from top 5000 in P (pageviews), Pm (pageviews median), PmR (pageviews median per reference count) and PmL ((pageviews) median per length of text) models.
Table 11 shows the most popular and reliable types of the sources in selected language versions of Wikipedia in February 2020 based on PR model. In almost all language versions websites are the most reliable sources. Magazines and business related source are top 10 of the most reliable types of sources in all languages. Film databases are one of the most reliable sources in Arabic, French, Italian, Polish and Portuguese Wikipedia. In other languages such sources are placed above 19th place. Arabic, English, French, Italian and Chinese Wikipedia preferred newspapers as a reliable source more than in other languages that placed such sources lower in the ranking (but above the 14th place). News agencies are more reliable for Persian Wikipedia comparing with other languages. Government agencies as a source has much more reliability in Persian and Swedish Wikipedia than in other languages. Holding companies provides more reliable information for Japanese and Chinese languages. In Dutch and Polish Wikipedia archive websites has relatively higher position in the reliability ranking. Periodical sources are more reliable German, Spanish and Polish Wikipedia. Review aggregators are more reliable in Arabic and Polish Wikipedia comparing other considered languages. Television networks in on 7th place in German Wikipedia and on 14th place in Portuguese Wikipedia, while other languages have such sources even on lower then 20th place (even 125th place). Social networking services are placed in top 20 of the most reliable types of sources in Japanese, Polish and Chinese Wikipedia. Weekly magazines are in the top 10 of English, Italian, Portuguese and Russian Wikipedia.
Table 11. The most popular and reliable types of the sources in selected language versions of Wikipedia in February 2020 based on PR model. Source: own calculations based on Wikimedia dumps using complex extraction of references with semantic databases (Wikidata, DBpedia) to identify type of the source. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table11.
Based on the knowledge about type of each source we decided to limit the ranking to specific area. We chosen only periodical sources which aligned to one of the following types: online newspaper (Q1153191), magazine (Q41298), daily newspaper (Q1110794), newspaper (Q11032), periodical (Q1002697), weekly magazine (Q12340140). The top of the most reliable periodical sources in all considered language versions in Wikipedia in February 2020 occupies: Variety, Entertainment Weekly, The Washington Post, USA Today, People, The Indian Express, The Daily Telegraph, Time, Pitchfork, Rolling Stone.
The most popular periodical sources in Wikipedia articles from 55 language versions using different popularity and reliability models in February 2020 showed in Table 12. There are sources that have stable reliability in all models–“Variety” has always 1st place, “Entertainment Weekly” 2nd-3nd place, “The Washington Post” occupies 2nd-4th place, “USA Today” took 4th-5th place depending on the model. Despite the fact that “Lenta.ru” is the 6th most commonly used periodical source in different languages of Wikipedia (using F model), it is placed in 21st and 19th places using P and Pm models, respectively. “The Daily Telegraph” is in the top 10 most reliable periodical sources in all models. “People” is in 18th place in frequency ranking, but at the same time took 4th place in the PmR model.
Table 12. The most popular periodical sources in Wikipedia articles from 55 language versions using different popularity and reliability models in February 2020. Source: own calculations based on Wikimedia dumps using complex extraction of references with semantic databases (Wikidata, DBpedia) to identify type of the source. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table12.
Given local rankings of periodical we can consider the difference of reliability and popularity between different language versions. Table A2 shows the position in local rankings of periodical sources in different language versions of Wikipedia in February 2020 using PR model. Almost in all considered languages (except Dutch) “Variety” took 1st-4th places in local rankings of the most reliable periodical sources. Some sources that are in leading positions in local rankings are not presentet at all as a sources in some languages. For example. “Aliqtisadi” (Arabic news magazine) is in the 2nd place in Arabic Wikipedia, but in English, Persian, Italian, Japanese, Russian Wikipedia position this source is lower then 600th place and not presented in other language as a source. Similar tendencies is to “Ennahar newspaper”, which has 5th place in Arabic Wikipedia. For the German Wikipedia 2nd, 3rd and 4th place belongs to “Die Tageszeitung”, “DWDL.de”, “Auto, Motor und Sport”. For Spanish Wikipedia leading local periodical sources are: “20 minutos”, “El Confidencial”, “Entertainment Weekly”, “¡Hola!”. In Persian Wikipedia one of the most reliable periodical source “Donya-e-Eqtesad”, that is not presented at all in most of the considered languages. The most reliable sources in French Wikipedia include: “Le Monde”, “Jeune Afrique”, “Le Figaro”, “Huffington Post France”. Italian version of Wikipedia contains such the most reliable local sources as: “la Repubblica”, “Il Post”, “Il Fatto Quotidiano”. In Japan Wikipedia leading reliable sources includes “Nihon Keizai Shimbun”, “Tokyo Sports”, “Yomiuri Shimbun”. Dutch Wikipedia contains “De Volkskrant”, “Algemeen Dagblad”, “Het Laatste Nieuws”, “Trouw”, “NRC Next” as one of the most reliable periodical sources. Polish Wikipedia has “Wprost” and “TV Guide” in top 3 periodical sources. In Portuguese one of the most reliable periodical sources are “Veja” and “Exame”. “Lenta.ru” and “Komsomolskaya Pravda” are leading periodical sources in Russian Wikipedia. Swedish language version has “Sydsvenskan”, “Dagens Industri” and “Helsingborgs Dagblad” as leading reliable sources. “VnExpress” took 1st place in the most reliable periodical sources of Vietnamese Wikipedia. “Apple Daily” is the most reliable periodical source in Chinese language version.

8. Temporal Analysis

Using complex extraction of the references apart from data from February 2020, we also used dumps from November 2019, December 2019 and January 2020. Based on those data we measure popularity and reliability of the sources in different months.
Table 13 shows position in rankings of popular and reliability sources with identified title depending on period in all considered languages versions of Wikipedia using PR model. Results showed that some of the sources didn’t changes their position in the ranking based on PR model. This is especially applicable to sources with leading position. For example “Deadline Hollywood”, “Variety”, “Entertainment Weekly”, “Rotten Tomatoes”, “Oricon” in each of the studied month he occupied the same place in top 10. “Internet Movie Database” and “TV by the Numbers” exchanged 3rd and 4th places. This is due to the fact that in absolute values of popularity and reliability measurement obtained using PR model, most of these sources have significant breaks from the closest competitors.
Table 13. Position in rankings of popular and reliable sources depending on period in all considered language versions of Wikipedia using PR model. Source: own work based on Wikimedia dumps using complex extraction of references with semantic databases (Wikidata, DBpedia) to identify title of the sources. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table13.
Next we decided to limit the list of the sources to periodical ones (as it was done in Section 7.2). Table 14 shows position in rankings of popular and reliable sources depending on period in all considered languages versions using PR model. Similarly to the previous table, we can observe not significant changes in position for the leading sources. In four considered months the top 10 most reliable periodical sources always included: “Variety”, “Entertainment Weekly”, “The Washington Post”, “People”, “USA Today”, “The Indian Express”, “The Daily Telegraph” “Pitchfork”, “Time”.
Table 14. Position in rankings of popular and reliable sources depending on period in all considered language versions using PR model. Source: own work based on Wikimedia dumps using complex extraction of references with semantic databases (Wikidata, DBpedia) to identify type of the source. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table14.
Results showed, that in the case of periodical sources we have less “stability” of the position in the ranking between different months comparing to the general ranking. For reasons already explained, the 2 top sources (Variety and Entertainment Weekly) did not change their positions. Additionally we can distinguish The Daily Telegraph with stable 7th place during whole considered period of time. Nevertheless in top 10 the most popular and reliable periodical sources of Wikipedia we can observe minor changes in positions. This applies in particular to People, Pitchfork, The Washington Post, USA Today, The Indian Express, Time. those sources grew or fell by 1-2 positions in the top 10 ranking during the November 2019-February 2020.
As it was mentioned before, minor changes in the ranking of sources during the considered period are mainly due to a large margin in absolute values of popularity and reliability measurement. This applies in particular to leading sources. However, what if there are relatively new sources that have significant prerequisites to be leaders or even outsiders in nearest future. The next section will describe the method and results of measuring.

9. Growth Leaders

The Wikipedia articles may have a long edition history. Information and sources in such articles can be changed many times. Moreover, criteria for reliability assessment of the sources can be changed over time in each language version of Wikipedia. Based on the assessment of the popularity and reliability of each source in Wikipedia in certain period of time (month) we can compare the differences between the values of the measurement. This can help to find out how popularity and reliability were changed (increase or decrease) in a particular month. For example, a certain Internet resource has only recently appeared and people have actively begun to use it as a source of information in Wikipedia articles. Another example: a well known and often used website in Wikipedia references dramatically lost confidence (reputation) as a reliable source, and editors actively start to replace this source with another or place additional reference next to existing ones. First place in such ranking means, that for the selected source we observed the largest growth of the popularity and readability score comparing previous month.
Table 15 shows which of the periodical sources had the largest growth of reliability in selected languages and period of times based on F model. For this table we have chosen only sources which was placed at least in top 5 in the growth leaders ranking of the one of the languages and selected month. Results shows that there is no stable growth leaders for the sources when we comparing different periods of time.
Table 15. Position of the periodical sources in growth ranking in selected language versions of Wikipedia and period of time using F model. Source: own work based on Wikimedia dumps using complex extraction of references with semantic databases (Wikidata, DBpedia). Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table15.
F model showed how many references in Wikipedia articles contain specific sources. Therefore, we can analyze which of the sources was more often added in references in Wikipedia articles in the considered month. For example in December 2019 “Die Tageszeitung” and “Handelsblatt” were leading growing sources in German Wikipedia, “Jeune Afrique” and “Les Inrockuptibles” were leading growing sources in French Wikipedia, “Komsomolskaya Pravda” and “Lenta.ru” were leading growing sources in Russian Wikipedia. In next month (January 2020) “Süddeutsche Zeitung” and “Die Tageszeitung” were leading growing sources in German Wikipedia, “Variety” and “La Montagne” were leading growing sources in French Wikipedia, “Variety” and “Komsomolskaya Pravda” were leading growing sources in Russian Wikipedia. In the last considered month (February 2020) “Die Tageszeitung” and “Variety” were leading growing sources in German Wikipedia, “Jeune Afrique” and “La Montagne” were leading growing sources in French Wikipedia, “Sport Express” and “Variety” were leading growing sources in Russian Wikipedia.
Table 16 shows which of the sources had the largest growth of reliability in different languages and period of times based on PR model. For this table we also have chosen only sources which was placed at least in top 5 in the growth leaders ranking of the one of the languages and selected month. Results showed also that there is no stable growth leaders for the sources when we comparing different period of time.
Table 16. Position of the sources in growth ranking in selected language versions of Wikipedia and period of time using PR model. Source: own work based on Wikimedia dumps using complex extraction of references with semantic databases (Wikidata, DBpedia). Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/table16.
PR model showed how many references in Wikipedia articles contains specific sources with taking into account popularity of the articles. Results showed that in December 2019 Variety and Deutsche Jagd-Zeitung were leading growing reliable sources in German Wikipedia, Variety and Entertainment Weekly were leading growing reliable sources in French Wikipedia, “Lenta.ru” and Entertainment Weekly were leading growing sources in Russian Wikipedia. In next month (January 2020) “Die Tageszeitung” and “DWDL.de” were leading growing sources in German Wikipedia, “Les Inrockuptibles” and “Le Monde” were leading growing sources in French Wikipedia, Variety and “Lenta.ru” were leading growing sources in Russian Wikipedia. In the last considered month (February 2020) “la Repubblica” and “Algemeen Dagblad” were leading growing sources in German Wikipedia, “Atlanta” (magazine) and “Le Figaro étudiant” were leading growing sources in French Wikipedia, New York Post and “Novosti Kosmonavtiki” were leading growing sources in Russian Wikipedia.

10. Discussion of the Results

This study describes different models for popularity and reliability assessment of the sources in different language version of Wikipedia. In order to use these models it is necessary to extract information about the sources from references and also measures related to quality and popularity of the Wikipedia articles. We observed that depending on the model positions of the websites in the rankings of the most reliable sources can be different. In language versions that are mostly used on the territory of one country (for example Polish, Ukrainian, Belarusian), the highest positions in such rankings are often occupied by local (national) sources. Therefore, community of editors in each language version of Wikipedia can have own preferences when a decision is made to enable (or disable) the source in references as a confirmation of the certain fact. So, the same source can be reliable in one language version of Wikipedia, while the community of editors of another language may not accept it in the references and remove or replace this source in an article.
The simplest of the proposed models in this study was based on frequency of occurrences, which is commonly used in related studies. Other 9 novel models used various combinations of measures related to quality and popularity of Wikipedia articles. We provided analysis on how the results differ depending on the model. For example, if we compare frequency-based (F) rankings with other (novel) in each language version of Wikipedia, then the highest average similarity will have AL-model (0.71 of rank correlation coefficient), the least – PmR-model (0.18 of rank correlation coefficient).
The analysis of sources was conducted in various ways. One of the approaches was to extract information from citation templates. Based on the related parameter in references of English Wikipedia we found the most popular publishers (such as United States Census Bureau, Oxford University Press, BBC, Cambridge University Press). The most commonly used journals in citation templates were: Nature, Astronomy and Astrophysics, Science, The Astrophysical Journal, Lloyd’s List, PLOS ONE, Monthly Notices of The Royal Astronomical Society, The Astronomical Journal, Billboard. However, such approach was limited and did not include references without citation templates. Therefore, we decided to use semantic databases to identify the sources and their types.
After obtaining data about types of the sources we found that magazines and business-related sources are in the top 10 of the most reliable types of sources in all considered languages. However, the preferred type of source in references depends on language version of Wikipedia. For example, film databases are one of the most reliable sources in Arabic, French, Italian, Polish and Portuguese Wikipedia. In other languages such sources are placed below 19th place.
Including data from Wikidata and DBpedia allowed us to find the best sources in specific area. Using information about the source types and after choosing only periodical ones, we found that there are sources that have stable reliability in all models - “Variety” has always 1st place, “Entertainment Weekly” 2nd-3nd place, “The Washington Post” occupies 2nd-4th place, “USA Today” took 4th-5th place depending on the model. Despite the fact that “Lenta.ru” is the 6th most commonly used periodical source in different languages of Wikipedia (using F model), it is placed on 21st and 19th place using P and Pm models respectively. “The Daily Telegraph” is in the top 10 most reliable periodical sources in all models. “People” is on 18th place in the frequency ranking but at the same time took 4th place in PmR model.
Using complex extraction of the references in addition to data from February 2020 we also used dumps from November 2019, December 2019, and January 2020. Based on those data we measured popularity and reliability of the sources in different months. After limiting the sources to periodicals we found that in four considered months the top 10 most reliable periodical sources in multilingual Wikipedia always included: “Variety”, “Entertainment Weekly”, “The Washington Post”, ”People”, “USA Today”, “The Indian Express”, “The Daily Telegraph”, “Pitchfork”, and “Time”. Minor changes in the ranking of sources appearing during the considered period are mainly due to a large margin in absolute values of popularity and reliability measurement.
Different approaches assessing reliability of the sources presented in this research contribute to a better understanding which references are more suitable for specific statements that describe subjects in a given language. Unified assessment of the sources can help in finding data of the best quality for cross-language data fusion. Such tools as DBpedia FlexiFusion or GlobalFactSync Data Browser [,] collect information from Wikipedia articles in different languages and present statements in a unified form. However, due to independence of edition process in each language version, the same subjects can have similar statements with various values. For example, population of the city in one language can be several years old, while other language version of the article about the same city can update this value several times a year on a regular basis along with information about the source. Therefore, we plan to create methods for assessing sources of such conflict statements in Wikipedia, Wikidata and DBpedia to choose the best one. This can help to improve quality in cross-language data fusion approaches.
Proposed models can also help to assess the reliability of sources in Wikipedia on a regular basis. It can support understanding preferences of the editors and readers of Wikipedia in particular month. Additionally, it can be helpful to automatically detect sources with low reliability before user will insert it in the Wikipedia article. Moreover, results obtained using the proposed models may be used to suggest Wikipedians sources with higher reliability scores in selected language version or selected topic.

10.1. Effectiveness of Models

In this section we present the assessment of the models’ effectiveness. Python algorithms prepared for purposes of this study were tested on desktop computer with Intel Core i7-5820K CPU and SSD hard drive. Algorithms used only one thread of the processor. Due to the fact that each model used own set of measures, we divided assessment into several stages, including extracting of:
  • External links using basic extraction method on compressed gzip dumps with total volume 12 GB-0.28 milliseconds per article on average.
  • Sources from references using complex extraction method on bzip2 dumps with total volume 64 GB-2 milliseconds per article on average.
  • Text length of articles (as a number of characters) using compressed bzip2 dumps with total volume 64 GB-0.68 milliseconds per article on average.
  • Total page views for considered month using compressed bzip2 dumps with total volume 12 GB-0.25 milliseconds per article on average.
  • Median of daily page views for considered month using compressed bzip2 dumps with total volume 12 GB-0.26 milliseconds per article on average.
  • Number of authors of articles using compressed bzip2 dumps with total volume 170 GB-1.12 milliseconds per article on average.
Given the above and the fact we can calculate the effectiveness for each model during conversion, time the algorithm needs to calculate the popularity and reliability of the source is as follows:
  • F model: 2 milliseconds per article.
  • P, PR model: 2.25 milliseconds per article.
  • Pm, PmR model: 2.28 milliseconds per article.
  • PL model: 2.93 milliseconds per article.
  • PmL model: 2.94 milliseconds per article.
  • A, AR model: 3.12 milliseconds per article.
  • AL model: 3.8 milliseconds per article.

10.2. Limitations

Reliability as one of the quality dimensions is a subjective concept. Each person can have their own criteria to asses reliability of the gives sources. Therefore each Wikipedia language community can have its own definition of reliable source. Only English Wikipedia, as the most developed edition of this free encyclopedia, provided an extended list of reliable/unreliable sources []. However it not always been used-for example despite the fact that IMDb (Internet Movie Database) is market as ‘Generally unreliable’ it is used very often (see Figure 4 or Table A1). As we observed, in some cases such sources can be used in references with some limitations—it can describe some specific statements (but not all). Therefore additional analysis of the placement of such sources in the articles can help to find such limited areas, where some sources can be used.
In the study we proposed and used 10 models to assess the popularity and reliability of the sources in Wikipedia. Each of the model use some of the important measures related to content popularity and quality. However, there are other measures that have potential to improve presented approach. Therefore we plan to extend the number of such measures in model. We plan to analyze possibility of comparing the results with other approaches or lists of the sources. For example it can be the most popular websites based on special tools, or reliable sources according to selected standards in some countries.
Each of the model can have own weak and strong sides. For example, during the experiments we observed, that some of articles has overstated values of the page views in some languages in selected months. This can be deduced from other related measures of the article. Sources in such articles could get extra points. However, these were individual cases that did not significantly affect the results of the work. In future work we plan to provide additional algorithms to automatically find and reduce such cases.
To extract the sources from references, which usually are published as of the first day of each month. We have information only for specified timestamp of the articles and we do not analyze in what day the source was inserted (or deleted) in the Wikipedia article. If the source was inserted few minutes (seconds) before the process of creating dumps files was started, we will count it as it was presented during the last considered month. Moreover, it can be more negatively involve on the model if such source was deleted few minutes (seconds) after the dump creating was begun. In other words, if the reference with the specified source was inserted and deleted around the timestamp of dump files creation, it can slightly or strongly (depend on values of article measures) falsify the results of some of the models. Therefore, more detailed analysis of each edition of the article can help to find how long particular reference was presented in article.

11. Conclusions and Future Work

In this paper we used basic and complex extraction methods to analyze over 200 million references in over 40 million articles from multilingual Wikipedia. We extracted information about the sources and unified them using special identifiers such as DOI, JSTOR, PMC, PMID, arXiv, ISBN, ISSN, OCLC and other. Additionally we used information about archive URL and included templates in the articles.
We proposed 10 models in order to assess popularity and reliability of websites, news magazines and other sources in Wikipedia. We also used DBpedia and Wikidata to automatically identify the alignment of the sources to specific field. Additionally, we analyzed the differences of popularity and reliability assessment of the sources between different periods. Moreover, we also conducted analysis of the growth leaders in each considered month. Results showed that depending on model and time some of the source can have different directions and power of changes (rise or fall). Next, we compared the similarity of rankings that used different models.
Some of extended results on reliability assessment of the sources in Wikipedia are placed in BestRef project [].
In addition to what has already been described in the Section 10.2, in future work we plan to extend the popularity and reliability model. One of the directions is to take into account the position of the inserted reference in article and in list of the references. Next we plan to take into account features of the articles related to Wikipedia authors such as reputation or number of article watchers.
In this work we showed how it is possible to measure growth of the popularity and reliability of the sources based on differences in the Wikipedia content from several recent months. In our future research we plan to extend the time series to have more information about growth leaders in different years in each language version of Wikipedia.
Information about reliability of the sources can help to improve models for quality assessment of the Wikipedia articles. This can be especially useful to estimate sources of conflict statements between language versions of Wikipedia in articles related to the same subject. Additionally, one of the promising direction of the future work is to create methods for suggesting Wikipedia authors reliable sources for selected topics and statements in separate languages of Wikipedia.

Author Contributions

Conceptualization, W.L. and K.W.; methodology, W.L; software, W.L.; validation, W.L. and K.W.; formal analysis, K.W. and W.A.; investigation, W.L.; resources, W.A.; data curation, W.L.; writing–original draft preparation, W.L.; writing–review and editing, K.W. and W.A.; visualization, W.L.; supervision, K.W. and W.A.; project administration, K.W. and W.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Position in Local Rankings

Table A1. Position in the local rankings of the most popular and reliable sources in different language versions of Wikipedia in February 2020 using PR model. Source: own calculations based on Wikimedia dumps using complex extraction of references. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/a1.
Table A1. Position in the local rankings of the most popular and reliable sources in different language versions of Wikipedia in February 2020 using PR model. Source: own calculations based on Wikimedia dumps using complex extraction of references. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/a1.
SourceLanguage Version of Wikipedia
ardeenesfafritjanlplptrusvvizh
ad.nl416916663311,6636153108659712737319924003716113,152214212,739
adorocinema.com418917,03037311402-13,20417,889899016,592141215,00320,774575725,859
allocine.fr205139092921389012565176723231586148896351748184491
almaany.com323,568524927,303391459218,09821,354--10,374720992413,55232,987
appledaily.com.tw726024,734391731,35414,79443,41142,064840--410331,323-4262
cand.com.vn26,76880,00347,951--------75,342-318,821
deadline.com72112128115122051
dn.se23120731021742255765201131301223156121651882111091818
dwdl.de13865135919,6528051801271626,0425155457927,22132,027544811,97632,793
eiga.com271977454521609391920003130322,46415289262863546317433
elcinema.com123,353462838,2431744158525,52440,04512,26614,81735,23212,767734115,56326,656
expressen.se1392557300137982633894876097505545973883230111724
formulatv.com11211866795570532320259,42422,6955733248117125,33224,83732,378
hln.be20523577181717,37915,411147124,54855,13342069524117,06324,76340854307
ibge.gov.br-18,76113,2842115-19,876--7030-4427522,550290238,937
imdb.com24444713441248641513
infoescola.com14,81849,87217,542997-30,47611,193--7107544,20124,94555396575
irna.ir180666,843807220,057138,80366,34242,35017,815-16,45621,773-11,50317,543
kp.ru3177180987466253459241977933563548063413,0054591522361395
lenta.ru352325462930119248012547852363166134211578310676
lesinrocks.com19412308100416001399385960692301949738173074903238042401
mobot.org6862125,005433755211,2034969521010,80567342109537,40113,18693012,005
news.livedoor.com252931,8031628296711,697963213,4475-24,05710,329696528,94438898
news.mynavi.jp15225110139412,368426815,86516,9394-40,700388011,560718041045
nikkei.com319310966945571790385414022197740311524387012,83283664
oricon.co.jp2263606016768612134712606911312041115223
regeringen.se956612,561478921,114506568,510-64,85517,46833,056471125,8675301745,773
repubblica.it41320517326024031363118866234884540712211064466
research.amnh.org49,40049,86616,30413,141-28,28724,255-1410,29324,0653317-224,727
rottentomatoes.com16105918111950446971093014
scb.se336124877735188542800143916,38862123117591629312341739
skijumping.pl41,59458669,49316,664-25,73112,91962,18613,862351,61223,7635186-42,126
taz.de3959316485397785692382115,993191899613,190196822685773684
thefutoncritic.com13913019378741635233520405845825182
treccani.it33322327890234459128022292337523617868712809
trouw.nl93142869260242,7037579189933,55818,1855849113,87522,55724,77416,87027,600
tvbythenumbers.zap2it.com4522315336519119143751654912
tw.appledaily.com37,43723,16310,24553,429-58,799-1793-37,81061,70823,742-20015
universalis.fr5327335259046465812235180251275348712727101211,75418,729
variety.com1012335414137331944
vnexpress.net13,31018,184650458,2717212997239,942963919,41730,01828,70713,48612,17819857
volkskrant.nl2766918949687333451781377510,507212,1975107268716,051464415,292
web.archive.org43635212182412111718141057
who.int1113671352631322938286328610
Table A2. Position in local rankings of periodical sources in different language versions of Wikipedia in February 2020 using PR model. Source: own work based on Wikimedia dumps using complex extraction of references using complex extraction of references with semantic databases (Wikidata, DBpedia) to identify type of the source. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/a2.
Table A2. Position in local rankings of periodical sources in different language versions of Wikipedia in February 2020 using PR model. Source: own work based on Wikimedia dumps using complex extraction of references using complex extraction of references with semantic databases (Wikidata, DBpedia) to identify type of the source. Extended version of the table is available on the web page: http://data.lewoniewski.info/sources/a2.
SourcePosition in Local Rankings in Language Versions of Wikipedia
ardeenesfafritjanlplptrusvvizh
20 minutos17618618929587812656411946252333232262
Aftonbladet165715049701484546132111710139817081111127010-1369
Al-Ittihad10153029722731537--1397---1672--1290
Algemeen Dagblad387431917953731824381432184300538714202753
Aliqtisadi2-2022-669-23381138---1096---
Apple Daily562123364414068071768156173-15253081275-561
Auto, Motor und Sport11524535373-727376221275136-487585428-
China Press2227-14202356-1241-431--15672025-2548
DWDL.de16233611073471145270764362315111912963867671430
Dagens Industri133611339491682-2921572589-111462325312-1376
De Gelderlander92339710261774455628137010301045911041014-824637
De Morgen6822125085771572105934126254440830473355480
De Stentor13804181428--12232055194791333-18987956411696
De Volkskrant2931452835752212723304031669355299808373847
Die Tageszeitung37424144968013033253920611473223119766298
Donya-e-Eqtesad1272-266528052-2193----2833--1378
El Confidencial2432262193281579848525324383235321264190
El País21722440062051462634044808684260401309562
Ennahar newspaper5-2248-7271042---------
Entertainment Weekly862467713138461145
Exame7791474683610554117946679794911153191764185759
Expert1921085936117772913851501542749490205310-589926
Express Gazeta882941104513013021502119390779062117348-609824
Famitsu20041810503558101975570310-1599605693109362438
Finanztest22971404183621317049015651741006-1329909-316
Flight International3210237328445919701149871013725
Fokus5011538138012099599442054961-131676113156--
Folha de S. Paulo1191082652304621958306163481274871697-490834
Fortune293216452525543666412965103823
Gazeta do Povo14297451066385-104410111257--921156541123-
Helsingborgs Dagblad50580485796839171710495888537662793903154694
Het Laatste Nieuws214399430999836229109676231883619001162331341
Het Parool11493375505864274929333867166363017401116459538
Huffington Post France56953559940533463086012202403924515752531759
ISTOÉ8519971130668-1217919833112573289594956801106
Il Fatto Quotidiano3131262302115081474682765226353346636475663
Il Post5402075693326932183181536263299372436435440
Jeune Afrique392003422102124224463229425215364313276413
Komsomolskaya Pravda226187177418155120273131352523972350133140
la Repubblica6315455682291651104591731375962
La Tercera26948741776094991696953391311172511745379810
Le Figaro511285563321493527971784511343874704191009398
Le Monde1592443063001593246499248567325424639292351
Lenta.ru606714216292105166692242413911574398
Les Inrockuptibles2112872932331191129264222570283322547322228
NRC Next843344539111324868788410495707674837230445173
Nauka i Zhizn-25366104213711431506-104016352897-12051810
Nguoi Viet Daily News1126-1851--1064-------6858
Nihon Keizai Shimbun3221692065038142317712082791573686989014
Nikkei Business240913148982079-274712718--14102597-750306
Nishinippon Shimbun-129220921248-326617867---1576-1160115
O Estado de São Paulo89715861020590-61111441200-135251385728242829
PC Gamer515120301031555190172612531412
PC Games17858635936-10239081685563306849280441425534
Panorama565726506534885341101256-336734607-351425
People2556121381326146164193618
Pitchfork117287204015253736262028252655
Populär Historia129967118442420438251423021113-99818612187-1096
Rolling Stone762110131620152711142125283044
Rolling Stone Brasil1656231070969510638536379497561157103241018747905
Sai Gon Giai Phong71425092084-443184022171680-17978301535-3712
Sport Express3672953444903993133441794441004585626382514
Superinteressante17993132804991-7541720608-14686--446-
Svenska Dagbladet409199714951596-27897951158--8912379-414
Sydsvenskan495385566818705947004305475143055981782895
TV Guide6144175463465010359328391843854
TV Sorrisi e Canzoni1533866186971245369681805915126323357301312
TechCrunch71711914142792916122324511
Teknikens Värld-4081366-18114691357-11131607-11765-545
The Atlantic21251225724463231372442332035
The Daily Telegraph14128188161417189151617917
The Indian Express28845135992153901566187901474057
The New York Times15271416182238233438233821133
The Washington Post31331949182428252229181220
Time41191031120182210141513710
Tokyo Sports731198127040410205995272-8327691006-18119
Trouw7043345431687444279134658745267521053116310331276
USA Today69485101914151211208107
Variety111112248113422
Veja3565584421994793788168669695502619621394755
VnExpress92010189772021429745472371982127011717686711633
Vokrug sveta1906118313782121-8671055901865220-9-372687
Weekly Playboy1159-1581549-203613125-1344-1499-28931
Wired920131411189637131821291115
World Journal--714908--1307190-----10964
Wprost7416329458559081281127866598029305441004439795
Yomiuri Shimbun2731010372911563592565350113688281055-36743
¡Hola!18120418552891289122978911057207124273331
==

References

  1. Wikipedia Meta-Wiki. List of Wikipedias. Available online: https://meta.wikimedia.org/wiki/List_of_Wikipedias (accessed on 30 March 2020).
  2. English Wikipedia. Reliable Sources. Available online: https://en.wikipedia.org/wiki/Wikipedia:Reliable_sources (accessed on 30 March 2020).
  3. Internet Live Stats. Total Number of Websites. Available online: https://www.internetlivestats.com/total-number-of-websites/ (accessed on 30 March 2020).
  4. Eysenbach, G.; Powell, J.; Kuss, O.; Sa, E.R. Empirical studies assessing the quality of health information for consumers on the world wide web: A systematic review. JAMA 2002, 287, 2691–2700. [Google Scholar] [CrossRef] [PubMed]
  5. Price, R.; Shanks, G. A semiotic information quality framework: Development and comparative analysis. In Enacting Research Methods in Information Systems; Springer: Berlin, Germany, 2016; pp. 219–250. [Google Scholar]
  6. Xu, J.; Benbasat, I.; Cenfetelli, R.T. Integrating service quality with system and information quality: An empirical test in the e-service context. MIS Q. 2013, 37, 777–794. [Google Scholar] [CrossRef]
  7. Nielsen, F.Å. Scientific citations in Wikipedia. arXiv 2007, arXiv:0705.2106. [Google Scholar] [CrossRef]
  8. Lewoniewski, W.; Węcel, K.; Abramowicz, W. Analysis of references across Wikipedia languages. In Proceedings of the International Conference on Information and Software Technologies, Druskininkai, Lithuania, 12–14 October 2017; pp. 561–573. [Google Scholar]
  9. Characterizing Wikipedia Citation Usage. Analyzing Reading Sessions. Available online: https://meta.wikimedia.org/wiki/Research:Characterizing_Wikipedia_Citation_Usage/Analyzing_Reading_Sessions (accessed on 29 February 2020).
  10. Jemielniak, D.; Masukume, G.; Wilamowski, M. The most influential medical journals according to Wikipedia: Quantitative analysis. J. Med. Internet Res. 2019, 21, e11429. [Google Scholar] [CrossRef] [PubMed]
  11. Stvilia, B.; Twidale, M.B.; Smith, L.C.; Gasser, L. Assessing information quality of a community-based encyclopedia. Proc. ICIQ 2005, 5, 442–454. [Google Scholar]
  12. Blumenstock, J.E. Size matters: Word count as a measure of quality on Wikipedia. In Proceedings of the 17th International Conference on World Wide Web, Beijing, China, 21–25 April 2008; pp. 1095–1096. [Google Scholar]
  13. Lucassen, T.; Schraagen, J.M. Trust in wikipedia: How users trust information from an unknown source. In Proceedings of the 4th Workshop on Information Credibility, Raleigh, NC, USA, 27 April 2010; pp. 19–26. [Google Scholar]
  14. Yaari, E.; Baruchson-Arbib, S.; Bar-Ilan, J. Information quality assessment of community generated content: A user study of Wikipedia. J. Inf. Sci. 2011, 37, 487–498. [Google Scholar] [CrossRef]
  15. Conti, R.; Marzini, E.; Spognardi, A.; Matteucci, I.; Mori, P.; Petrocchi, M. Maturity assessment of Wikipedia medical articles. In Proceedings of the 2014 IEEE 27th International Symposium on Computer-Based Medical Systems (CBMS), New York, NY, USA, 27–29 May 2014; pp. 281–286. [Google Scholar]
  16. Piccardi, T.; Redi, M.; Colavizza, G.; West, R. Quantifying Engagement with Citations on Wikipedia. arXiv 2020, arXiv:2001.08614. [Google Scholar]
  17. Nielsen, F.Å.; Mietchen, D.; Willighagen, E. Scholia, scientometrics and wikidata. In Proceedings of the European Semantic Web Conference, Portorož, Slovenia, 28 May–1 June 2017; pp. 237–259. [Google Scholar]
  18. Teplitskiy, M.; Lu, G.; Duede, E. Amplifying the impact of open access: Wikipedia and the diffusion of science. J. Assoc. Inf. Sci. Technol. 2017, 68, 2116–2127. [Google Scholar] [CrossRef]
  19. Fetahu, B.; Markert, K.; Nejdl, W.; Anand, A. Finding news citations for wikipedia. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, Indianapolis, IN, USA, 24–28 October 2016; pp. 337–346. [Google Scholar]
  20. Ferschke, O.; Gurevych, I.; Rittberger, M. FlawFinder: A Modular System for Predicting Quality Flaws in Wikipedia. In Proceedings of the CLEF (Online Working Notes/Labs/Workshop), Rome, Italy, 17–20 September 2012; pp. 1–10. [Google Scholar]
  21. Flekova, L.; Ferschke, O.; Gurevych, I. What makes a good biography?: Multidimensional quality analysis based on wikipedia article feedback data. In Proceedings of the 23rd International Conference on World Wide Web, Seoul, Korea, 7–11 April 2014; pp. 855–866. [Google Scholar]
  22. Shen, A.; Qi, J.; Baldwin, T. A Hybrid Model for Quality Assessment of Wikipedia Articles. In Proceedings of the Australasian Language Technology Association Workshop, Brisbane, Australia, 6–8 December 2017; pp. 43–52. [Google Scholar]
  23. Di Sciascio, C.; Strohmaier, D.; Errecalde, M.; Veas, E. WikiLyzer: Interactive information quality assessment in Wikipedia. In Proceedings of the 22nd International Conference on Intelligent User Interfaces, Limassol, Cyprus, 13–16 March 2017; pp. 377–388. [Google Scholar]
  24. Dang, Q.V.; Ignat, C.L. Measuring Quality of Collaboratively Edited Documents: The Case of Wikipedia. In Proceedings of the 2016 IEEE 2nd International Conference on Collaboration and Internet Computing (CIC), Pittsburgh, PA, USA, 31 October–3 November 2016; pp. 266–275. [Google Scholar]
  25. Lewoniewski, W.; Węcel, K.; Abramowicz, W. Relative Quality and Popularity Evaluation of Multilingual Wikipedia Articles. Informatics 2017, 4, 43. [Google Scholar] [CrossRef]
  26. Lewoniewski, W.; Węcel, K.; Abramowicz, W. Multilingual Ranking of Wikipedia Articles with Quality and Popularity Assessment in Different Topics. Computers 2019, 8, 60. [Google Scholar] [CrossRef]
  27. Warncke-wang, M.; Cosley, D.; Riedl, J. Tell Me More: An Actionable Quality Model for Wikipedia. In Proceedings of the WikiSym 2013, Hong Kong, China, 5–7 August 2013; pp. 1–10. [Google Scholar]
  28. Lih, A. Wikipedia as Participatory Journalism: Reliable Sources? Metrics for evaluating collaborative media as a news resource. In Proceedings of the 5th International Symposium on Online Journalism, Austin, TX, USA, 16–17 April 2004; p. 31. [Google Scholar]
  29. Liu, J.; Ram, S. Using big data and network analysis to understand Wikipedia article quality. Data Knowl. Eng. 2018, 115, 80–93. [Google Scholar] [CrossRef]
  30. Wilkinson, D.M.; Huberman, B.A. Cooperation and quality in wikipedia. In Proceedings of the 2007 International Symposium on Wikis WikiSym 07, Montreal, QC, Canada, 21–23 October 2007; pp. 157–164. [Google Scholar] [CrossRef]
  31. Kane, G.C. A multimethod study of information quality in wiki collaboration. ACM Trans. Manag. Inf. Syst. (TMIS) 2011, 2, 4. [Google Scholar] [CrossRef]
  32. WikiTop. Wikipedians Top. Available online: http://wikitop.org/ (accessed on 30 March 2020).
  33. Lewoniewski, W. The Method of Comparing and Enriching Information in Multlingual Wikis Based on the Analysis of Their Quality. Ph.D. Thesis, Poznań University of Economics and Business, Poznań, Poland, 2018. [Google Scholar]
  34. Lerner, J.; Lomi, A. Knowledge categorization affects popularity and quality of Wikipedia articles. PLoS ONE 2018, 13, e0190674. [Google Scholar] [CrossRef] [PubMed]
  35. Wikimedia Downloads. English Wikipedia Latest Database Backup Dumps. Available online: https://dumps.wikimedia.org/enwiki/latest/ (accessed on 30 March 2020).
  36. English Wikipedia. 2019–2020 Coronavirus Pandemic. Available online: https://en.wikipedia.org/wiki/2019%E2%80%9320_coronavirus_pandemic (accessed on 30 March 2020).
  37. Vrandečić, D.; Krötzsch, M. Wikidata: A free collaborative knowledgebase. Commun. ACM 2014, 57, 78–85. [Google Scholar] [CrossRef]
  38. Wikidata. Available online: https://www.wikidata.org/wiki/Wikidata:Main_Page (accessed on 23 April 2020).
  39. Auer, S.; Bizer, C.; Kobilarov, G.; Lehmann, J.; Cyganiak, R.; Ives, Z. DBpedia: A Nucleus for a Web of Open Data. In The Semantic Web; Aberer, K., Choi, K.S., Noy, N., Allemang, D., Lee, K.I., Nixon, L., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., et al., Eds.; Springer: Berlin/Heidelberg, Germany, 2007; pp. 722–735. [Google Scholar]
  40. DBpedia. Available online: https://wiki.dbpedia.org/ (accessed on 23 April 2020).
  41. Frey, J.; Hofer, M.; Obraczka, D.; Lehmann, J.; Hellmann, S. DBpedia FlexiFusion the Best of Wikipedia> Wikidata> Your Data. In Proceedings of the International Semantic Web Conference, Auckland, New Zealand, 26–30 October 2019; pp. 96–112. [Google Scholar]
  42. GFS Data Browser. Available online: https://global.dbpedia.org (accessed on 23 April 2020).
  43. English Wikipedia. Perennial Sources. Available online: https://en.wikipedia.org/wiki/Wikipedia:Reliable_sources/Perennial_sources (accessed on 30 March 2020).
  44. BestRef. Popular and Reliable Sources of Wikipedia. Available online: https://bestref.net (accessed on 30 March 2020).

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.