The Modern Greek Language on the Social Web: A Survey of Data Sets and Mining Applications
Abstract
1. Introduction
2. Linguistic and Behavioral Patterns
2.1. Linguistic Patterns Analysis
2.1.1. NLP Tools
2.1.2. Linguistic Classification and Corpora
2.1.3. Argument Extraction
2.1.4. Authorship Attribution and Gender Identification
2.2. Offensive Behavior and Language Detection
2.2.1. Bullying in VLCs
2.2.2. Offensive Language on Twitter
3. Opinion-Mining
3.1. Politics and Voting Analysis
3.2. Marketing and Business Analysis
4. Discussion
5. Conclusions
Author Contributions
Funding
Conflicts of Interest
Abbreviations
| API | Application Programming Interface | 
| BILOU | Beginning-Inside-Outside-Unit | 
| BTO | Binary Term Occurrences | 
| CRF | Conditional Random Fields | 
| EAU | Event Analysis Unit | 
| FF | Feed-Forward neural network | 
| FN | False Negative | 
| FP | False Positive | 
| FST | Finite State Transducers | 
| HITL | Human In The Loop | 
| HTML | Hypertext Markup Language | 
| kNN | k-Nearest Neighbor | 
| LR | Logistic Regression | 
| LSTM | Long short-term memory | 
| MCKL | Multiple Convolution Kernel Learning | 
| ML | Machine Learning | 
| MOOC | Massive Open Online Course | 
| NERC | Name-Entity Recognition and Classification | 
| NLP | Natural Language Processing | 
| OMW | Open Multilingual Wordnet | 
| PLC | Physical Learning Community | 
| POS | Part-of-speech | 
| PPV | Positive Predictive Value | 
| RF | Random Forest | 
| SaaS | Software as a Service | 
| SVM | Support Vector Machine | 
| TF | Term Frequency | 
| TF-IDF | Term Frequency-Inverse Document Frequency | 
| TN | True Negative | 
| TNR | True negative Rate | 
| TP | True Positive | 
| TO | Term Occurrences | 
| URL | Uniform Resource Locator | 
| UTF | Unicode Transformation Format | 
| VLC | Virtual Learning Community | 
| WEKA | Waikato Environment for Knowledge Analysis | 
References
- Alexandridis, G.; Michalakis, K.; Aliprantis, J.; Polydoras, P.; Tsantilas, P.; Caridakis, G. A Deep Learning Approach to Aspect-Based Sentiment Prediction. In Proceedings of the IFIP International Conference on Artificial Intelligence Applications and Innovations, Neos Marmaras, Greece, 5–7 June 2020; Springer: Cham, Switzerland, 2020; pp. 397–408. [Google Scholar]
- Nikiforos, M.N.; Kermanidis, K.L. A Supervised Part-Of-Speech Tagger for the Greek Language of the Social Web. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; European Language Resources Association: Marseille, France, 2020; pp. 3861–3867. [Google Scholar]
- Markopoulos, G.; Mikros, G.; Iliadi, A.; Liontos, M. Sentiment analysis of hotel reviews in Greek: A comparison of unigram features. In Cultural Tourism in a Digital Era; Springer: Cham, Switzerland, 2015; pp. 373–383. [Google Scholar]
- Nikiforos, S.; Tzanavaris, S.; Kermanidis, K.L. Virtual learning communities (VLCs) rethinking: Influence on behavior modification—Bullying detection through machine learning and natural language processing. J. Comput. Educ. 2020, 7, 531–551. [Google Scholar] [CrossRef]
- Petasis, G.; Spiliotopoulos, D.; Tsirakis, N.; Tsantilas, P. Sentiment analysis for reputation management: Mining the greek web. In Proceedings of the Hellenic Conference on Artificial Intelligence, Ioannina, Greece, 15–17 May 2014; Springer: Cham, Switzerland, 2014; pp. 327–340. [Google Scholar]
- Pitenis, Z.; Zampieri, M.; Ranasinghe, T. Offensive language identification in greek. arXiv 2020, arXiv:2003.07459. [Google Scholar]
- Sababa, H.; Stassopoulou, A. A classifier to distinguish between cypriot greek and standard modern greek. In Proceedings of the 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS), Valencia, Spain, 15–18 October 2018; pp. 251–255. [Google Scholar]
- Tsakalidis, A.; Aletras, N.; Cristea, A.I.; Liakata, M. Nowcasting the stance of social media users in a sudden vote: The case of the Greek Referendum. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Turin, Italy, 22–26 October 2018; pp. 367–376. [Google Scholar]
- Vallet, D.; Fernandez, M.; Castells, P.; Mylonas, P.; Avrithis, Y. A contextual personalization approach based on ontological knowledge. In Proceedings of the 17th European Conference on Artificial Intelligence (ECAI 2006), Contexts and Ontologies: Theory, Practice and Applications, Riva del Garda, Italy, 28 August–1 September 2006. [Google Scholar]
- Mikros, G.K. Authorship attribution and gender identification in Greek blogs. Methods Appl. Quant. Linguist. 2012, 21, 21–32. [Google Scholar]
- Baxevanakis, S.; Gavras, S.; Mouratidis, D.; Kermanidis, K.L. A machine learning approach for gender identification of Greek tweet authors. In Proceedings of the 13th ACM International Conference on PErvasive Technologies Related to Assistive Environments, Corfu, Greece, 30 June–3 July 2020; pp. 1–4. [Google Scholar]
- Kalamatianos, G.; Mallis, D.; Symeonidis, S.; Arampatzis, A. Sentiment analysis of Greek tweets and hashtags using a sentiment lexicon. In Proceedings of the 19th Panhellenic Conference on Informatics, Athens, Greece, 1–3 October 2015; pp. 63–68. [Google Scholar]
- Goudas, T.; Louizos, C.; Petasis, G.; Karkaletsis, V. Argument extraction from news, blogs, and social media. In Hellenic Conference on Artificial Intelligence; Springer: Cham, Switzerland; Ioannina, Greece, 15–17 May 2014; pp. 287–299. [Google Scholar]
- Goudas, T.; Louizos, C.; Petasis, G.; Karkaletsis, V. Argument extraction from news, blogs, and the social web. Int. J. Artif. Intell. Tools 2015, 24, 1540024. [Google Scholar] [CrossRef]
- Sardianos, C.; Katakis, I.M.; Petasis, G.; Karkaletsis, V. Argument extraction from news. In Proceedings of the 2nd Workshop on Argumentation Mining, Lisbon, Portugal, 17–21 September 2015; pp. 56–66. [Google Scholar]
- Nikiforos, S.; Tzanavaris, S.; Kermanidis, K.L. Bullying Behavior and Project-based Activities in Virtual Learning Communities (VLCs). In Proceedings of the 2020 5th South-East Europe Design Automation, Computer Engineering, Computer Networks and Social Media Conference (SEEDA-CECNSM), Corfu, Greece, 25–27 September 2020; pp. 1–5. [Google Scholar]
- Tzanavaris, S.; Nikiforos, S.; Mouratidis, D.; Kermanidis, K.L. Virtual Learning Communities (VLCs) rethinking: From negotiation and conflict to prompting and inspiring. Educ. Inf. Technol. 2020, 26, 257–278. [Google Scholar] [CrossRef]
- Pontiki, M.; Gavriilidou, M.; Gkoumas, D.; Piperidis, S. Verbal Aggression as an Indicator of Xenophobic Attitudes in Greek Twitter during and after the Financial Crisis. In Proceedings of the Workshop about Language Resources for the SSH Cloud, Marseille, France, 11–16 May 2020; pp. 19–26. [Google Scholar]
- Lo, S.L.; Cambria, E.; Chiong, R.; Cornforth, D. Multilingual sentiment analysis: From formal to informal and scarce resource languages. Artif. Intell. Rev. 2017, 48, 499–527. [Google Scholar] [CrossRef]
- Cambria, E.; Das, D.; Bandyopadhyay, S.; Feraco, A. Affective computing and sentiment analysis. In A Practical Guide to Sentiment Analysis; Springer: Cham, Switzerland, 2017; pp. 1–10. [Google Scholar]
- Alpaydin, E. Introduction to Machine Learning; MIT Press: Cambridge, MA, USA, 2020; pp. 1–3, 11, 12. [Google Scholar]
- Russell, S.; Norvig, P. Artificial Intelligence: A Modern Approach, 2nd ed.; Prentice Hall: Upper Saddle River, NJ, USA, 2003; pp. 798–801, 852, 853. [Google Scholar]
- Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement learning: A survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef]
- Montague, P.R. Reinforcement learning: An introduction, by Sutton, RS and Barto, AG. Trends Cogn. Sci. 1999, 3, 360. [Google Scholar] [CrossRef]
- Van Otterlo, M.; Wiering, M. Reinforcement learning and markov decision processes. In Reinforcement Learning; Springer: Berlin/Heidelberg, Germany, 2012; pp. 3–42. [Google Scholar]
- Petasis, G.; Karkaletsis, V.; Paliouras, G.; Androutsopoulos, I.; Spyropoulos, C.D. Ellogon: A new text engineering platform. arXiv 2002, arXiv:cs/0205017. [Google Scholar]
- Goutte, C.; Gaussier, E. A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In Proceedings of the European Conference on Information Retrieval, Santiago de Compostela, Spain, 21–23 March 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 345–359. [Google Scholar]
- Thanopoulos, A.; Kermanidis, K.; Fakotakis, N. Challenges in extracting terminology from Modern Greek texts. In Proceedings of the 3rd International Workshop on Text-Based Information Retrieval (TIR-06), Riva del Garda, Italy, 28 August–1 September 2006; p. 53. [Google Scholar]
- Clackson, J. Indo-European Linguistics: An Introduction; Cambridge University Press: Cambridge, UK, 2007. [Google Scholar]
- Barðdal, J.; Smitherman, T.; Bjarnadóttir, V.; Danesi, S.; Jenset, G.B.; McGillivray, B. Reconstructing constructional semantics: The dative subject construction in old norse-icelandic, latin, ancient greek, old russian and old lithuanian. Stud. Lang. Int. J. Spons. Found. Found. Lang. 2012, 36, 511–547. [Google Scholar]
- Sido, J.; Pražák, O.; Přibáň, P.; Pašek, J.; Seják, M.; Konopík, M. Czert–Czech BERT-like Model for Language Representation. arXiv 2021, arXiv:2103.13031. [Google Scholar]
- Husain, F.; Uzuner, O. A Survey of Offensive Language Detection for the Arabic Language. ACM Trans. Asian Low-Resour. Lang. Inf. Process. (TALLIP) 2021, 20, 1–44. [Google Scholar] [CrossRef]
- Lopez, C.E.; Vasu, M.; Gallemore, C. Understanding the perception of COVID-19 policies by mining a multilanguage Twitter dataset. arXiv 2020, arXiv:2003.10359. [Google Scholar]
- Vilares, D.; Peng, H.; Satapathy, R.; Cambria, E. BabelSenticNet: A commonsense reasoning framework for multilingual sentiment analysis. In Proceedings of the 2018 IEEE Symposium Series on Computational Intelligence (SSCI), Bangalore, India, 18–21 November 2018; pp. 1292–1298. [Google Scholar]
- Athanasiou, V.; Maragoudakis, M. A novel, gradient boosting framework for sentiment analysis in languages where NLP resources are not plentiful: A case study for modern Greek. Algorithms 2017, 10, 34. [Google Scholar] [CrossRef]
- Chatzikyriakidis, S. Clitics in Four Dialects of Modern Greek: A Dynamic Account. Ph.D Thesis, University of London, London, UK, 2010. [Google Scholar]
- Sosoni, V.; Kermanidis, K.L.; Stasimioti, M.; Naskos, T.; Takoulidou, E.; Van Zaanen, M.; Castilho, S.; Georgakopoulou, P.; Kordoni, V.; Egg, M. Translation crowdsourcing: Creating a multilingual corpus of online educational content. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
- Cambria, E.; Schuller, B.; Xia, Y.; Havasi, C. New avenues in opinion mining and sentiment analysis. IEEE Intell. Syst. 2013, 28, 15–21. [Google Scholar] [CrossRef]
- Kermanidis, K.L.; Maragoudakis, M. Political sentiment analysis of tweets before and after the Greek elections of May 2012. Int. J. Soc. Netw. Min. 2013, 1, 298–317. [Google Scholar] [CrossRef]
- Charalampakis, B.; Spathis, D.; Kouslis, E.; Kermanidis, K. A comparison between semi-supervised and supervised text mining techniques on detecting irony in greek political tweets. Eng. Appl. Artif. Intell. 2016, 51, 50–57. [Google Scholar] [CrossRef]
- Charalampakis, B.; Spathis, D.; Kouslis, E.; Kermanidis, K. Detecting irony on greek political tweets: A text mining approach. In Proceedings of the 16th International Conference on Engineering Applications of Neural Networks (INNS), Rhodes, Greece, 25–28 September 2015; pp. 1–5. [Google Scholar]
- Papanikolaou, K.; Papageorgiou, H.; Papasarantopoulos, N.; Stathopoulou, T.; Papastefanatos, G. “Just the Facts” with PALOMAR: Detecting Protest Events in Media Outlets and Twitter. In Proceedings of the International AAAI Conference on Web and Social Media, Cologne, Germany, 17–20 May 2016; Volume 10. [Google Scholar]
- Papanikolaou, K.; Papageorgiou, H. Protest Event Analysis: A Longitudinal Analysis for Greece. In Proceedings of the Workshop on Automated Extraction of Socio-political Events from News 2020, Marseille, France, 11–16 May 2020; pp. 57–62. [Google Scholar]
- Antonakaki, D.; Spiliotopoulos, D.; Samaras, C.V.; Pratikakis, P.; Ioannidis, S.; Fragopoulou, P. Social media analysis during political turbulence. PLoS ONE 2017, 12, e0186836. [Google Scholar] [CrossRef] [PubMed]
- Tziovas, D. Greece in Crisis: The Cultural Politics of Austerity; Bloomsbury Publishing: London, UK, 2017. [Google Scholar]
- Bond, F.; Fellbaum, C.; Hsieh, S.K.; Huang, C.R.; Pease, A.; Vossen, P. A multilingual lexico-semantic database and ontology. In Towards the Multilingual Semantic Web; Springer: Berlin/Heidelberg, Germany, 2014; pp. 243–258. [Google Scholar]
- Alessia, D.; Ferri, F.; Grifoni, P.; Guzzo, T. Approaches, tools and applications for sentiment analysis implementation. Int. J. Comput. Appl. 2015, 125. [Google Scholar]
- Charalabidis, Y.; Loukis, E.N.; Androutsopoulou, A.; Karkaletsis, V.; Triantafillou, A. Passive crowdsourcing in government using social media. Transform. Gov. People Process Policy 2014, 8, 283–308. [Google Scholar] [CrossRef]
- Ramaswamy, V.; Gatignon, H.; Reibstein, D.J. Competitive marketing behavior in industrial markets. J. Mark. 1994, 58, 45–55. [Google Scholar] [CrossRef]
- Aldayel, H.K.; Azmi, A.M. Arabic tweets sentiment analysis–a hybrid scheme. J. Inf. Sci. 2016, 42, 782–797. [Google Scholar] [CrossRef]
- Psomakelis, E.; Tserpes, K.; Anagnostopoulos, D.; Varvarigou, T. Comparing methods for twitter sentiment analysis. arXiv 2015, arXiv:1505.02973. [Google Scholar]
- Tripathi, P.; Vishwakarma, S.K.; Lala, A. Sentiment analysis of english tweets using rapid miner. In Proceedings of the 2015 International Conference on Computational Intelligence and Communication Networks (CICN), Jabalpur, India, 12–14 December 2015; pp. 668–672. [Google Scholar]
- Shoemark, P.; Kirby, J.; Goldwater, S. Inducing a lexicon of sociolinguistic variables from code-mixed text. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, Brussels, Belgium, 1 November 2018; pp. 1–6. [Google Scholar]
- Trye, D.; Calude, A.S.; Bravo-Marquez, F.; Keegan, T.T.A.G. Māori loanwords: A corpus of New Zealand English tweets. In Proceedings of the Vocab@ Leuven 2019, Florence, Italy, 1–3 July 2019. [Google Scholar]
- Erdmann, A.; Habash, N. Complementary strategies for low resourced morphological modeling. In Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology, Brussels, Belgium, 31 October 2018; pp. 54–65. [Google Scholar]
- Foster, J.; Cetinoglu, O.; Wagner, J.; Le Roux, J.; Hogan, S.; Nivre, J.; Hogan, D.; Van Genabith, J. # hardtoparse: POS Tagging and Parsing the Twitterverse. In Proceedings of the AAAI-11 Workshop on Analyzing Microtext, San Francisco, CA, USA, 7–11 August 2011. [Google Scholar]
- Bach, N.X.; Linh, N.D.; Phuong, T.M. An empirical study on POS tagging for Vietnamese social media text. Comput. Speech Lang. 2018, 50, 1–15. [Google Scholar] [CrossRef]
- Öztürk, N.; Ayvaz, S. Sentiment analysis on Twitter: A text mining approach to the Syrian refugee crisis. Telemat. Inform. 2018, 35, 136–147. [Google Scholar] [CrossRef]
- Carneiro, H.C.; França, F.M.; Lima, P.M. Multilingual part-of-speech tagging with weightless neural networks. Neural Netw. 2015, 66, 11–21. [Google Scholar] [CrossRef] [PubMed]
- Gimpel, K.; Schneider, N.; O’Connor, B.; Das, D.; Mills, D.; Eisenstein, J.; Heilman, M.; Yogatama, D.; Flanigan, J.; Smith, N.A. Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, OR, USA, 19–24 June 2010; pp. 42–47. [Google Scholar]
- Gao, W.; Fang, Y.; Wang, Y.; Zhang, F. HRCE: Detecting Food Security Events in Social Media. J. Phys. Conf. Ser. 2020, 1437, 012090. [Google Scholar] [CrossRef]
- Popescu, A.M.; Pennacchiotti, M. Detecting controversial events from twitter. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, Toronto, ON, Canada, 26–30 October 2010; pp. 1873–1876. [Google Scholar]
- Popescu, A.M.; Pennacchiotti, M.; Paranjpe, D. Extracting events and event descriptions from twitter. In Proceedings of the 20th International Conference Companion on World Wide Web, Hyderabad, India, 28 March–1 April 2011; pp. 105–106. [Google Scholar]
| 1. | |
| 2. | |
| 3. | |
| 4. | |
| 5. | https://hilab.di.ionio.gr/index.php/en/datasets/, (accessed on 15 November 2020). | 
| 6. | Refers to accounts that are not directly related to specific individuals, usually public figures. | 
| 7. | https://rapidminer.com/, (accessed on 18 February 2021). | 
| 8. | http://hashtag.nonrelevant.net/downloads.html, (accessed on 18 February 2021). | 
| 9. | https://github.com/hb20007/greek-dialect-classifier, (accessed on 15 November 2020). | 
| 10. | https://sklearn.org/, ( accessed on 15 November 2020). | 
| 11. | Unicode Transformation Format. | 
| 12. | https://github.com/mixstef/tramooc, (accessed on 19 November 2020). | 
| 13. | https://appen.com/, (accessed on 19 November 2020). | 
| 14. | http://www.cs.waikato.ac.nz/~ml/weka, (accessed on 23 December 2020). | 
| 15. | http://nomad-project.eu/en, ( accessed on 23 December 2020). | 
| 16. | Only the corpus containing news can be redistributed for research purposes. | 
| 17. | A technique for NLP that employs a two-layer neural net that processes text by creating a vector of real numbers to represent a word. | 
| 18. | Corpus and code can be provided upon request | 
| 19. | A multi-set of words based on a simplified representation for NLP and Information Retrieval. | 
| 20. | |
| 21. | https://sites.google.com/site/offensevalsharedtask/home, (accessed on 19 January 2021) | 
| 22. | https://scrapy.org/, (accessed on 15 November 2020) | 
| 23. | http://insideairbnb.com/, https://www.pewresearch.org/download-datasets/, https://guides.library.cornell.edu/polling_survey_online, www.tripadvisor.com.gr, (accessed on 15 November 2020) | 
| 24. | Available for research purposes upon request | 
| 25. | Sets of cognitive synonyms | 
| 26. | http://compling.hss.ntu.edu.sg/omw/, (accessed on 16 April 2021) | 
| 27. | http://sentistrength.wlv.ac.uk/, (accessed on 16 April 2021) | 
| 28. | http://socialsensor.iti.gr/, (accessed on 16 April 2021) | 
| 29. | POS, lemmatization, chunking and parsing | 
| 30. | https://palopro.io/en/, (accessed on 15 January 2021). | 
| 31. | www.tripadvisor.com.gr, (accessed on 15 January 2021). | 
| 32. | A text is represented as the total of the contained words. | 
| 33. | TF-IDF bag-of-words and TO. | 
| Paper | Social Media | Data | Methods | Tool | 
|---|---|---|---|---|
| [2] | 2405 tweets | tokenization, normalization, | POS tagger | |
| 31,697 tokens (April 2019) | encoding, annotation | |||
| [12] | 4,373,197 tweets | automated & manual rating, | Sentiment analysis lexicon | |
| 30,778 users | removal: stop words & tone marks, | |||
| 54,354 hashtags (April 2008–November 2014) | stemming, uppercase | |||
| [7] | 1039 sentences | anonymization, manual annotation, | Bidialectal classifier | |
| 7026 words (Cypriot Greek) | removal: tabs, newlines, duplicate punctuation, | |||
| forums, blogs | 7100 words (Modern Greek) (March–April 2018) | insertion: spaces, n-grams, encoding, tokenization | ||
| [37] | MOOC | multilingual corpus | conversion into plain text, | - | 
| course forum text | removal: special characters, non-content lines, | |||
| quiz assessment text | multiple whitespaces, tokenization, sentence segmentation, | |||
| subtitles of online video lectures | special elements markup | |||
| [13] | Twitter, news | 204 documents | manual annotation, tokenization, sentence splitting, | - | 
| blogs, sites | 16,000 sentences: | POS tagging, feature selection, gazetteer lists, lexica, TF-IDF | ||
| 760 argumentative | ||||
| [14] | Twitter, news | 204 documents | manual annotation, tokenization, sentence splitting, | - | 
| blogs, sites | 16,000 sentences: | POS tagging, feature selection, gazetteer lists, lexica, TF-IDF | ||
| 760 argumentative | ||||
| comparison with NOMAD data set | ||||
| [15] | 1st: 77 million documents | POS tagging, cue words, distributed representations of words, | - | |
| 2nd: 300 news articles, | feature extraction, sentiment analysis, lowercase | |||
| news, blogs | 1191 argumentative segments | |||
| [10] | Blogs | 1000 blog posts | stylometric variables, character & word uni-grams, | Authorship attribution & | 
| 406,460 words (September 2010–August 2011) | bi-grams, tri-grams, feature extraction | author’s gender identification | ||
| [11] | 45,848 tweets | removal: stop words, encoding: Bag-of-Words, TF-IDF | Author’s gender identification | 
| Paper | Algorithms | Results | Contribution | Open Issues | 
|---|---|---|---|---|
| [2] | Naive Bayes | accuracy up to 99.87% | 1st data set for Greek social text | larger data sets | 
| ID3 | 1st tag set | data from different social media | ||
| 1st supervised POS tagger | syntactic & semantic analysis tools | |||
| linguistic diversity by region | ||||
| tracking controversial events & | ||||
| mapping connections with users | ||||
| [12] | Pearson Kendall | sentiment correlation | public benchmark data set | lexicon for social text | 
| correlation | set of intensity rated tweets | more linguistic data | ||
| automated method for detecting | larger data set & number of raters | |||
| intensity (tweets & hashtags) | ||||
| temporal changes in intensity (hashtags) | ||||
| [7] | Naive Bayes, SVM, LR | 95% mean accuracy | 1st classifying Greek dialects in social text | applications in social media moderation | 
| bidialectal corpus & classifier | and academic research | |||
| most informative features | larger corpus including POS | |||
| detecting dialects prior to online translation | ||||
| extension with Greeklish, Pontic & Cretan Greek | ||||
| distinction between Katharevousa & Ancient Greek | ||||
| [37] | - | - | multilingual parallel corpus to train, tune, | - | 
| test machine translation engines | ||||
| translation crowd-sourcing experiment | ||||
| examination of difficulties: text genre, language pairs, | ||||
| large data volume, quality assurance, | ||||
| crowd-sourcing workflow | ||||
| [13] | LR, RF, SVM, CRF | accuracy up to 77.4% | 2-step argument extraction | more features & algorithms | 
| novel corpus | testing of Markov models | |||
| most determinant features | ||||
| [14] | LR, RF, SVM, CRF | accuracy up to 77.4% | 2-step argument extraction | more features & algorithms | 
| novel corpus | testing of Markov models | |||
| most determinant features | comparing performance with approaches for English | |||
| experiments with unsampled data | ||||
| [15] | word2vec CRF | up to 39.7% precision | semi-supervised multi-domain method | extending the gazetteer lists | 
| 27.59% recall | argument extraction | bootstrapping on CRF | ||
| 32.53% F1 score | novel corpus | more algorithms | ||
| patterns based on verbs and POS | ||||
| grammatical inference algorithm | ||||
| [10] | SVM | accuracy 85.4% & 82.6% | tool for authorship attribution & author’s gender | - | 
| identification with many candidates | ||||
| novel social text corpus | ||||
| 10 most determinant features | ||||
| [11] | SVM | accuracy up to 70% | novel, manually annotated, corpus | more features combining gender & age | 
| NLP framework for gender identification of the author | neural networks | 
| Paper | Social Media | Data | Methods | Tool | 
|---|---|---|---|---|
| [4] | VLCs, Wikispaces | 500 dialogue segments (VLC-1) | anonymization, segmentation in periods, manual annotation, | Detection of bullying behavior | 
| 83 dialogue segments (VLC-2) | lowercase, tokenization, n-grams, removal: stop words, | |||
| stemming, pruning of low/high-frequency terms, length filtering | ||||
| [16] | VLCs, Wikispaces | 126 dialogue segments | anonymization, segmentation in periods | Detection of bullying behavior | 
| 1167 dialogue segments | ||||
| [17] | VLCs, Google Docs | activity log files, dialogue text, | semantic segmentation, annotation | Discourse & artifacts analysis | 
| questionnaires, interviews | ||||
| [6] | 4779 tweets | keyword search, removal: emoticons, URLs, accentuation, | - | |
| (May–June 2019) | normalization, lowercase, manual annotation, TF-IDF, | |||
| n-grams, POS tags, word embeddings, LSTM | ||||
| [18] | 4,490,572 tweets | keyword search, knowledge representation, | - | |
| (2013–2016) | computational analysis, data visualization, tokenization, | |||
| sentence splitting, POS tagging, lemmatization | ||||
| [18] | 4,490,572 tweets | keyword search, knowledge representation, | - | |
| (2013–2016) | computational analysis, data visualization, tokenization, | |||
| sentence splitting, POS tagging, lemmatization | 
| Paper | Algorithms | Results | Contribution | Open Issues | 
|---|---|---|---|---|
| [4] | Naive Bayes, Naive Bayes Kernel, | accuracy up to 94.2% | 1st study of the influence of VLCs on behavior | - | 
| ID3, Decision Tree, Feed-forward NN, | modification regarding bullying | |||
| Rule induction, Gradient boosted trees | NLP & ML framework for automatic detection | |||
| of aggressive behavior & bullying | ||||
| authentic humanistic data collected | ||||
| under real conditions | ||||
| [16] | Text analysis & annotation t-test | - | authentic humanistic data collected | - | 
| under real conditions | ||||
| [17] | Struggle Analysis Framework | - | collaboration assessment | - | 
| action analysis | ||||
| interaction analysis | ||||
| evaluation of presentations & dialogues | ||||
| [15] | SVM Stochastic Gradient Descent | F1 score 89% | 1st Greek annotated data set for offensive | - | 
| Naive Bayes 6 deep learning models | language identification | |||
| [18] | - | - | framework for verbal aggression analysis | extending to other types of attacks | 
| verbal attacks against target groups | including other languages for | |||
| xenophobic attitudes during the Greek | cross-country & cross-cultural comparisons | |||
| financial crisis | 
| Paper | Social Media | Data | Methods | Tool | Algorithms | Results | Contribution | Open Issues | 
|---|---|---|---|---|---|---|---|---|
| [39] | 57,424 tweets | sentiment analysis | - | - | - | confirmation of the | implementation of more sophisticated text | |
| (April to May 2012) | TF | alignment between | analysis techniques | |||||
| distribution | actual and social | |||||||
| web-based political sentiment | ||||||||
| [41] | 61.427 tweets (May 2012) | text classification, | OMW | NLTK | precision 82.4% | real-world application | use of stemmer/lemmatizer, | |
| divided into | semantic analysis | of irony detection | tool unavailability, | |||||
| Parties & Leaders | small manually | |||||||
| 44.438 tweets | trained data set | |||||||
| (after cleanup) | ||||||||
| [40] | 61,427 tweets (May 2012) | collective classification | OMW | J48, Naive Bayes, | Supervised: | - | application with | |
| divided into | Functional Trees, | Functional Trees 82.4% | Word Vector or | |||||
| Parties & Leaders | K-Star, RF, SVM, | Semi-supervised: | Deep Learning | |||||
| 44,438 tweets | Neural Networks | RF 83.1% | ||||||
| (after cleanup) | ||||||||
| [44] | 48,000 Tweets | data collection and | SentiStrength | - | highlight the societal | political domain analysis | bot recognition | |
| in two data sets | entity identification, | and political trends | ||||||
| (July & September 2015) | volume analysis, | |||||||
| entity co-occurrence, | ||||||||
| sentiment analysis | ||||||||
| and topic modeling | ||||||||
| [8] | 14,62M tweets, | convolutional kernels | User Voting | SVM, LR, FF, RF | MCKL = 0.02% | real time systematic study on | annotating a random sample of | |
| 283 Greek “stopwords” | intention modeling | nowcasting | Twitter users for increased performance | |||||
| the voting intention | ||||||||
| [42] | Twitter & Digital | 540,989 articles | PEA & NERC | NLP, NERC, | - | quantitative | - | enrichment of sociopolitical | 
| news media | (1996–2014) | EAU and FST | and qualitative | event categories | ||||
| [43] | Twitter & Digital | 540,989 articles & | PEA & NERC | NLP, NERC, | - | quantitative | - | enrichment of | 
| news media | 166,100,543 tweets | EAU and FST | and qualitative | sociopolitical | ||||
| (1996–2014) | event categories | 
| Paper | Social Media | Data | Methods | Tool | Algorithms | Results | Contribution | Open Issues | 
|---|---|---|---|---|---|---|---|---|
| [5] | PaloPro | Blogs, | sentiment analysis, | OpinionBuster | NLP, CRFs | performance | sentiment and polarity | further optimization | 
| Twitter and | reputation management, | > 93% | detection of a word | |||||
| Facebook posts | brand monitoring | in its context | ||||||
| [3] | SVM classifier | - | effectiveness of TF-IDF | Further use of | SVM classifier | - | effectiveness of TF-IDF | further use of | 
| for automatic sentiment | contextual | for automatic sentiment | contextual | |||||
| classifier for hotel reviews | Valence shifters | classifier for hotel reviews | Valence shifters | 
| Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. | 
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Nikiforos, M.N.; Voutos, Y.; Drougani, A.; Mylonas, P.; Kermanidis, K.L. The Modern Greek Language on the Social Web: A Survey of Data Sets and Mining Applications. Data 2021, 6, 52. https://doi.org/10.3390/data6050052
Nikiforos MN, Voutos Y, Drougani A, Mylonas P, Kermanidis KL. The Modern Greek Language on the Social Web: A Survey of Data Sets and Mining Applications. Data. 2021; 6(5):52. https://doi.org/10.3390/data6050052
Chicago/Turabian StyleNikiforos, Maria Nefeli, Yorghos Voutos, Anthi Drougani, Phivos Mylonas, and Katia Lida Kermanidis. 2021. "The Modern Greek Language on the Social Web: A Survey of Data Sets and Mining Applications" Data 6, no. 5: 52. https://doi.org/10.3390/data6050052
APA StyleNikiforos, M. N., Voutos, Y., Drougani, A., Mylonas, P., & Kermanidis, K. L. (2021). The Modern Greek Language on the Social Web: A Survey of Data Sets and Mining Applications. Data, 6(5), 52. https://doi.org/10.3390/data6050052
 
         
                                                





 
       