1. Introduction
Currently, the mass media have a powerful influence on modern society due to their close association with the socalled “mediated culture”. This modern phenomenon exposes people’s opinions and expectations through a vast quantity of documents created in social networks and other discussion forums. Monitoring and comprehending the deliberated topics is an essential element of each trustworthy modern multimedia system. Therefore, the study of social network chatting is a useful forecasting tool in many areas, such as election results [
1,
2], criminal activity [
3,
4], and social events [
5,
6,
7]. This kind of analysis sensibly summarizes the feelings and aims revealed in social networks without any relation to texts’ linguistic content, while it is generally an outcome of the collective creativity uncovered by an informal language in a faithful matter.
On the other hand, the oldstyle media (television, radio, and newspapers) are oneway information systems conveying most of the material by a sufficiently official language. Systems of this kind (whether written, broadcasts, or conversations) predictably track the day’s occasions and often express their points of view. Newspapers, being the de facto source of all traditional media, are bundles of connected articles with content gathered by a small group of people frequently sharing a common bias caused by changes in political, economic, and social events in society and dominant elites and exploiting them as propaganda resources.
Propagandistic purposes can be achieved by means of a language adaption of electronic and print media following the characteristics of the target audience taking into account the level of education, political preferences, customs and traditions, gender, and linguistic (dialectical) features. For these reasons, it is natural to suppose that changes in the media language content may expose changes in social status. As was mentioned in [
8] “The lack of a genuine will to liberate state media in the so called Arab Spring countries from the grip of political power was, and is still a major handicap. In the immediate aftermath of the uprisings, state media journalists managed to challenge the reverential, uniform style and content they’d had to adopt for decades, but only very briefly. They quickly reverted to old practices in the face of raging political battles and attempts by current regimes to again use state media as a means to justify and legitimize their policies.”
Media in English is an essential object of study. However, at least 422 million people in 57 countries use Arabic as their primary language and 250 million as their second one. Indeed, Arabic is the fifth amongst the most widely spoken conversational tongues across the world. Arabic is more complicated language grammatically and morphologically in comparison with the European ones. Analysis of the traditional Arabic media attracts attention from this standpoint and the aspiration to study the conventional Arabic press to reflect the government’s opinions and expectations.
In this article, a new method for online recognition of meaningful changes in social status is suggested. The detected transformations of the linguistic content in Arabic newspapers play an indicator role. The possibility of such modeling is predominantly proven in [
9]. The mentioned methodology represents each issue as a frequency vector of appropriately chosen character
Ngrams, so that the writing style evolution of a newspaper is depicted by employing a mean correlation between the current issue and several of its forerunners. Change points of the constructed in this way time series indicate possible changes in the linguistic content. However, the obtained data sequence is quite noisy and requires further smoothing, which is done via a wavelet transform and a clustering procedure across all given issuing periods. Thus, this process involves information about “the future published issues”.
The currently proposed approach acts in an “online” fashion. After a preprocessing stage, the words in the issues’ texts are exchanged by vectors obtained within a word embedding methodology. The collection of all similarities (in our application, the wellknown “$cos$similarities”) of the appropriately chosen adjacent vectors typifies the consistent linguistic template of the issue, and a change in distributions of these issuebased samples indicates a difference in the underlying newspaper template.
To implement this concept, a twostep procedure is proposed. The first step consists of evaluating the current issue sample distribution versus the union of ones corresponding to several issue’s predecessors. We use, aiming to stabilize this inherently imbalanced procedure, a repeating undersampling approach accompanied by a twosample test. In the second stage, the normalized entropy of the returned pvalues is calculated to form a time series of the newspaper’s representation. Notice that in the case of the stable style behavior, the entropy fluctuates almost to its maximum value, and each significant descent is probably due to a difference in the language content of the issue and its predecessors.
The rest of the paper is organized in the following way.
Section 2 is devoted to the necessary background.
Section 3 presents the novel evolutionary model of the publishing process.
Section 4 is devoted to the experimental study results.
Section 5 consists of discussion and conclusions.
2. Background Knowledge
Many traditional techniques in the text mining area are associated with vector representations (e.g., bagofwords) of a text as vectors of terms’ occurrences. It is a wellknown fact that this methodology loses semantic information about words’ sequencing, namely disregarding the order of words and their mutual appearances. Another common Markov Ngram approach takes into account the word arrangement in short sentences, but leads to sparse representations. Deep learning embedding systems provide innovative tactics for the treatment of the sparsity problem by supplying words’ representation as realvalued vectors such that close vectors present words with comparable sense. The conceptual leap involves the idea to connect each attribute, not to a single dimension, but to a whole dense vector. It makes it possible to translate a given text to a matrix with semantically taught columns. Hence, word embedding is known as a suite of language modeling methods presenting words of a given glossary in digital vector spaces, usually having high dimensionality. This incredibly valuable technique for natural language processing preserves the essential semantic and syntactic information of terms allowing adjusting the performance in many natural language processing tasks. Popular offtheshelf word embedding models in use today are:
Word2Vec (by Google) [
10]
FastText (by Facebook) [
12]
The purpose of embedding is to grant close spatial positions to words with a comparable context. Formally, the “$cos$similarity”, calculated as the angle between such vectors, has to be close to one. The selection of the dimensionality, commonly denoted as d, of the constructed vectorial representation, is a famous open problem connected to choosing an appropriate suboptimal solution. Apparently, $d=300$ is the most frequently used value.
Hypothesis testing (see, e.g., [
14]) is another tool involved in the model. This methodology is one of frequently employed in various branches of machine learning and data mining due to its numerous successful applications (see, e.g., [
15]). Twosample hypothesis testing is an approach invented to examine if two samples of independent random values have the same probability distribution function. In this case, the null and alternative hypotheses are formulated such that the null hypothesis
$\left({H}_{0}\right)$ reveals the “no difference” assumption, and the alternative hypothesis
$\left({H}_{1}\right)$ claims that there is “a difference in the underlying distributions”. Formally speaking, let
$X={X}_{1},{X}_{2},\dots ,{X}_{m}$ and
$Y={Y}_{1},{Y}_{2},\dots ,{Y}_{n}$ be two independent random variables with unknown distribution functions
F and
G. A twosample problem tests the null hypothesis:
against the alternative:
A test is typically made using a statistic calculated from drawn samples together with its
pvalue, which is the probability of the actual instances happening, if the null hypothesis is correct. Therefore, the
pvalue can be understood as a random variable generated from the drawn samples. The standard application of the probability integral transform of the test statistic based on the null hypothesis distribution allows concluding that the scattering of the calculated
pvalue is uniform in
$[0,1]$ if the null hypothesis is correct.
Supposedly, the most useful and general twosample test is the Kolmogorov–Smirnov one (the
$KS$test) [
16,
17]. This nonparametric procedure works with the continuous, onedimensional probability distributions and uses the statistic:
evaluating the distance between the empirical distribution functions
$\tilde{F}\left(x\right)$ and
$\tilde{G}\left(x\right)$ of the samples. For sufficiently large samples, the distribution of the presented statistic does not depend on the underlying distributions of the data.
3. Evolutionary Model of the Publishing Process
This section presents a dynamic model of the daily publishing process based on the paradigm of the human writing process. A shared outlook (see, for example, [
18]) of this development calls up four central parts: planning, drafting, editing, and writing the final draft. Consequently, this perception naturally suggests the inherent linguistic and semantic dependence of the current text segment with formerly written ones. This presumption is also, to some extent, correct for sequential newspaper issues. Nevertheless, a book’s content appears to be much smoother since it passes an iterative editing exercise commonly less feasible for daily media. The regular publishing process is more like an industrial system functioning with the influence of the opinions of a group of authors and their writing styles.
The proposed model handles a time series constructed to reflect the semantic comportment of dailies. In our stance, anomalies surfacing in such a series are mainly caused by significant changes in the social state. Let us consider a series of
m sequential issues:
of a newspaper and take a vocabulary of terms
$\mathbb{V}$ with an embedding matrix
$\mathbf{Emb}\in {\mathbb{R}}^{\left\mathbb{V}\right\times d}$ associating the words from the vocabulary
$\mathbb{V}$ with
ddimensional vectors, where the columns correspond to the word embedding. The semantic evolution of
$\mathbb{D}$ is modeled in our approach as follows.
Embedding step: At the outset, each word $w\in \mathbb{V}$ in all issues is transformed into a ddimensional vector according to the appropriate column in the matrix $\mathbf{Emb}$. Subsequently, each issue ${D}_{i},\phantom{\rule{3.33333pt}{0ex}}i=1,\dots ,m$ is converted to a twodimensional array ${M}_{i},\phantom{\rule{3.33333pt}{0ex}}i=1,\dots ,m$ by sequentially concatenating the corresponding vector representations of its words.
Semantic issues’ pattern: As was mentioned, the intention of a word embedding procedure is to convert words with similar contexts to have close spatial shapes. A similarity measure between realvalued vectors (like cosine or the Euclidean distance) provides an accepted tool to quantify the words’ semantical relationship. As an issue is composed of words, the similarity between words can typify its semantical structure. The Euclidean distance takes into account a vector magnitude, while the cosine similarity depends on just the angle between the vectors. Therefore, this measure is more robust to changes in the frequencies of the semantically similar word, whereas the magnitude is sensible to occurrences and neighborhood diversity. For this reason, the resulting vectors of “semantically similar” terms produced, for example, by the Word2vec procedure may be close in cosine similarity, but still have a sizeable Euclidean distance between them.
From this point of view, the $cos$similarity between adjacent words can represent the desired semantic configuration. Keeping it up, each issue ${D}_{i},\phantom{\rule{3.33333pt}{0ex}}i=1,\dots ,m$ is represented by its pattern distribution ${G}_{i,l,h},\phantom{\rule{3.33333pt}{0ex}}i=1,\dots ,m$ being a distribution of the $cos$similarities calculated within a window with size l sliding in the increment of h over the columns of the corresponding matrix ${M}_{i},\phantom{\rule{3.33333pt}{0ex}}i=1,\dots ,m$.
Semantic change model: The collection of the pattern distributions
${G}_{i,l,h}$ of the
$cos$similarities found for neighboring terms represents the daily under consideration. For a given issue
${D}_{i}$, introduce:
where
T denotes the number of
${D}_{i}$ several “precursors” involved in the assessment. According to the model under consideration, a semantic change does not occur if two collections
${\mathbb{G}}_{T,i,l,h}$ and
${G}_{i,l,h}$ are identically distributed, i.e., the connection between contiguous words is kept. Thus, the model deals with the classical hypothesis testing to determine if two underlying distributions are equal. An essential tool to verify this is a twosample test intended to check if two samples are drawn from the same population. However, we compare two substantially differentlysized sets. Obviously, the size of
${\mathbb{G}}_{T,i,l,h}$ is expected to be approximately
T times larger than
${G}_{j,l,h}$. To overcome this problem, the multiple testing metrology can be applied. Specifically, let us select a natural number
N and a twosample test
H returning within its output parameters the resulting
pvalue. A multipletest procedure consists of the following steps presented in Algorithm 1.
Algorithm 1 Constructing representative sets: 
Input:${G}_{i,l,h},{\mathbb{G}}_{T,i,l,h}$: two sets to be tested$\left(\left{\mathbb{G}}_{T,i,l,h}\right\ge \left{\mathbb{G}}_{i,l,h}\right\right)$; H: twosample test; $Nsize$: sample size; N: number of samples drawn.

Procedure: 
$counter=0$ 
Repeat N times: 
$\left(a\right)$ $counter=counter+1$ 
$\left(b\right)$ Draw random sample $Stemp$ without replacement from ${\mathbb{G}}_{T,i,l,h}$ with size$Nsize$. 
$\left(c\right)$ Apply H, and obtain the current pvalue 
Return the array p 
Thus, the stylistic relationship between the issue
${D}_{i}\phantom{\rule{3.33333pt}{0ex}}(i>T)$ and its
T “precursors”
$\left\{{D}_{j},\phantom{\rule{3.33333pt}{0ex}}j=iT,\dots ,j1\right\}$ is represented as a set:
of the calculated
pvalues.
Entropic time series: In our approach, the null hypothesis about the equality of the primary distributions of
${\mathbb{G}}_{T,i,l,h}$ and
${G}_{i,l,h}$ is tested against the alternative hypothesis stating the difference of the distributions. As was mentioned in
Section 2, the distribution of the
pvalues found here is uniform in
$[0,1]$ if the null hypothesis is true. As a result, the stationary behavior of semantic sets
$\left\{{\mathbb{Z}}_{i,T,N,V},\phantom{\rule{3.33333pt}{0ex}}i=1,\dots ,m\right\}$ is characterized by their sufficiently high entropy, and a reduction of the entropy value indicates a change in the language content and probably a change in the social state. Consecutive assessment of such entropy values leads to the following time series (onedimensional signal):
exposing the semantic evolution of the newspaper. Normalization by the maximal entropy value is provided, aiming to standardize the entropy behavior for different sample sizes. Therefore, the stable signal corresponds to the steady linguistic content of the newspaper, and its acute falling indicate changes in one.
Anomaly detection: Anomalous falls of a signal may specify the desired change points, and therefore, an anomaly detection method is an essential tool for this purpose. In this paper, the standard modified Thompson Tau test [
19] is applied. The mentioned method is a famous process intended to recognize outliers in a set. The approach supplies a statistically clarified rejection zone to decide if a data point is an outlier resting upon the standard data deviation and the average. This method consists of the following: Let
X be a vector of size
n. Denote by
$\overline{X}$ the average of
X and by
$\sigma \left(X\right)$ the standard deviation of
X. The rejection threshold is determined using the formula:
where
${t}_{\alpha /2,n2}$ is the critical value from the Student distribution based on significance level
$\alpha $ and degree of freedom
$n2$. For each data point
$x\in X$, the value is calculated as:
and then, if
$\delta \left(x\right)>rej$, a data point is recognized as an outlier, else (if
$\delta \left(x\right)\le rej$) a data point is not considered as an outlier.
The described procedure is applied in our model in the following manner. Locating at the position i corresponding to the current issue ${D}_{i}$, we construct a sequence ${s}_{iL+j},\phantom{\rule{3.33333pt}{0ex}}j=1,\dots ,L$ containing $L1$ “precursors” of $s\left(i\right)$ and this value itself. In the following step, the standard modified Thompson Tau test checks if $s\left(i\right)$ is an outlier in the constructed series.
It should be noted that the proposed method, in fact, works in a realtime mode since, at each current moment, it just uses information about previously published issues of a newspaper. The following Algorithm 2 presents such an “online” version.
Algorithm 2 
Input: 
$\mathbb{V}$: a vocabulary of terms; $\mathbf{Emb}\in {\mathbf{R}}^{\left\mathbb{V}\right\times d}$: terms’ embedding matrix; l: size of a sliding window; h: sliding increment; T: number of “precursors” compared with the current issue; H: twosample test; $Nsize$: sample size; N: number of resamplings; α: significance level of the modified Thompson Tau test; L: lag parameter in the anomaly testing procedure.

Procedure: 
Get the current issue ${D}_{i},\phantom{\rule{3.33333pt}{0ex}}i>{T}_{0}=max(T,L)$ Construct the embedding matrices ${M}_{j}$ of ${D}_{j},\phantom{\rule{3.33333pt}{0ex}}j=i{T}_{0},\dots ,i$ using the underlying embedding matrix $\mathbf{Emb}\in {\mathbb{R}}^{\left\mathbb{V}\right\times d}$ Construct the $cos$similarities collections ${G}_{i,l,h}$ between the columns of matrices ${M}_{j}\phantom{\rule{3.33333pt}{0ex}}j=i{T}_{0},\dots ,i$ using sliding window with parameters $(l,h)$ Apply Algorithm 1 with the parameters $H,N$, and $Nsize$ to the sets ${G}_{i,l,h},{\mathbb{G}}_{T,i,l,h}$, and get the representative sets ${\mathbb{Z}}_{j,T,N,V},\phantom{\rule{3.33333pt}{0ex}}j=i{T}_{0},\dots ,i$ Construct a time series sequence of the normalized Entropy values: ${\mathbb{S}}_{i}=\left\{{S}_{j},\phantom{\rule{3.33333pt}{0ex}}j=i{T}_{0},\dots ,i\right\}$ Check using the modified Thompson Tau test if ${S}_{i}$ is an outlier in ${\mathbb{S}}_{i}$ If the next issue of the newspaper is available, then goto 1; otherwise stop
