3.3. Linguistic Cues
First, the CEO’s letters were analyzed to validate whether they exhibit the same linguistic features that the literature reported as characteristic of deceptive texts: (1) length of the texts, (2) lexical diversity or TTR, (3) more third person usage and fewer self-references and (4) use of emotional words, particularly adjectives.
First, deceptive texts were found to be longer than truthful ones [
9,
10]. In
Table 1, we already described the characteristics of the corpus in terms of the document length average. On average, fraudster’s texts are shorter (951 tokens per document) than non-fraudster’s texts (1427 tokens per document). However, note that 7 documents out of the 35 fraudster documents are longer than the non-fraudster average, and that 18 out of the 60 non-fraudster documents are shorter than the non-fraudster length average.
Second, deceptive texts were found to have lower lexical diversity or richness than truthful ones [
9,
10]. The Type-Token Ratio (TTR) is the usual measure to assess lexical diversity: the closer the ratio is to 1, the greater the diversity. For our documents, TTR is 0.67 for fraudster’s texts and 0.74 for non-fraudster’s texts. TTR shows that fraudster’s texts in Spanish also tend to have lower lexical diversity, as suggested by the literature for English. However, take into account that a
t-test significance assessment shows that the difference is not significant (Student
t-test).
Third, different studies have demonstrated that deceptive texts show fewer self-references, for instance, first person pronouns in English (see [
6,
12]). For Spanish, which is a non-obligatory subject language, third versus first person reference choices can be better observed in verbal morphology than in pronouns, for instance, the forms ‘quiero’ (I want), or ‘tengo’ (I have to), which are normally used for expressing gratitude in such expressions as ‘quiero agradecer’ (I want to thank) at the beginning or the end of the letters.
We used the PoS annotated version of the corpus to extract the number of verbal forms in third and first, singular and plural. The figures are shown in
Table 2 in terms of absolute and relative frequencies.
Relative frequencies show that fraudster Spanish letters do not contain fewer occurrences of first person verb forms as claimed in the deceptive literature. However, note that the figures in
Table 2 are quite similar for both classes, and the differences were not found to be statistically significant either.
Finally, many authors have reported that deceptive texts have more emotional words [
17] and, in particular, more negative words than non-deceptive texts [
2,
9,
10,
11,
13,
14,
16]. The literature shows that most of the differences in the number of positive and negative words come from the number of adjectives. Accordingly, for assessing the differences in the number of positive and negative words, if any, we counted the occurrence of previously classified positive and negative adjectives by using the Spanish dictionaries provided by the SO-CAL resources [
5]. We relied on the SO-CAL resources because they have one of the very few large polarity dictionaries available for Spanish. The SO-CAL Spanish adjective dictionaries that we used for the experiments contain inflected forms or tokens in four different lists that correspond to different sources, including machine-translated English sentiment dictionaries. For our study, we worked with 2200 adjectives—1105 negative and 1095 positive—coming from the SO-CAL non-automatically translated source files to confirm whether the finding in the literature about deceptive texts having more emotional adjectives in general and having fewer positive and more negative ones in particular, also holds for fraudster’s texts.
The SO-CAL resources demonstrated high levels of accuracy in classifying the sentiment of texts from a range of domains: news articles, social media comments and blog posts. However, Loughran and McDonald [
28] found that almost three-fourths of the negative words in an English general domain word list were not negative in a financial context. For instance, they found that such nouns as ‘tax’ and ‘cost’, which are neutral terms in the financial context, were classified as negative in a general domain word list. Moreover, in [
28], it was demonstrated that, because of a limited lexical variation (TTR) in these types of texts, very few words account for a large percentage of the total number of polar terms. Thus, in order to prevent the errors pointed out in [
28], we decided to work only with adjectives that were found to be less subject to polarity changes depending on the domain [
29] and to validate the polarity of the SO-CAL adjective lists, using the method proposed by Hatzivassiloglou et al. [
30]. This method is based on the idea that coordination with conjunctions, such as ‘and’ and ‘but’, impose a constraint on the semantic orientation of the words they are coordinating. According to this linguistic constraint, adjectives that appear to be linked by an adjoining conjunction, such as ‘and’ in English or ‘y’ (‘e’ as a graphical variant) in Spanish, have the same semantic orientation. On the other hand, adjectives that are coordinated by a disjoining conjunction, such as ‘but’ in English or ‘pero’ in Spanish, have opposite orientations. For example, while the combination ‘justa y solidaria’ (fair and supportive) is found in our corpus and sounds natural, the combination of ‘justa pero solidaria’ (fair but supportive) would be odd because both being positives cannot be in a ‘but’ coordination.
In the list of SO-CAL adjectives, we found only four examples for which a change of polarity was necessary. For instance, although ‘político’ (political) was assigned a negative orientation score in the SO-CAL dictionary, in our texts, it appeared in coordination with exclusively positive adjectives: ‘civil’ (civil), ‘económico’ (economic), and ‘social’ (social). The other three adjectives that resulted in an opposite polarity after our revision were ‘cambiante’ (changeable), ‘comercial’ (marketing, business), and ‘simple’ (simple).
As shown in
Table 3, in our corpus, fraudster’s documents do have a higher number of adjectives in total as well as when counting positive and negative ones separately, which confirms the findings of Goel and Uzuner [
17] also for financial texts in Spanish.
Indeed, figures in
Table 3 show that in our CEO letter corpus, there are more negative adjectives in the fraudster group but also more positive ones, which contradicts the generalized finding for English deceptive texts that claimed that deceptive texts contain fewer positive words and more negative words [
2,
9,
10,
11,
13,
14,
16]. Note that relative frequency (RF) in
Table 3 normalizes the number of occurrences because of the different sizes of the compared corpora as shown in
Table 1.