Gender Prediction of Generated Tweets Using Generative AI
Abstract
:1. Introduction
- We collected a dataset containing gender-specific GenAI-generated tweets from users using ChatGPT, and human-authored tweets labeled by gender (OpenAI 2024).
- We presented a novel approach and methodology for collecting a dataset tagged with hashtags, utilizing a temporal approach to capture trending hashtags over different time periods. This ensures a balanced and representative sample of tweets.
- We employed a two-stage feature selection method to identify the most discriminative features for gender prediction. This involved analyzing term frequencies and applying the Chi-square test to select features with high discriminative scores that significantly contribute to distinguishing gender-specific language in tweets.
- Through extensive experimentation with various Machine Learning (ML) classifiers, including Support Vector Machine (SVM), Naive Bayes (NB), Decision Tree (DT), Random Forest (RF), and Multi-Layer Perceptron (MLP), we validated the efficacy of our method. Our results demonstrate that we can accurately predict the gender of text content in tweets generated by GenAI.
2. Related Works
3. Materials and Methods
3.1. Motivation
3.2. Dataset
3.3. Approach
Algorithm 1. Selected the top terms based on their discriminative scores | |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | // Input: List of tweets for class 1 (e.g., male) and class 2 (e.g., female) // Output: Selected top features for gender prediction // Calculate Term Frequencies function calculate_term_frequencies(tweets, class): initialize term_freq as an empty dictionary initialize total_length as 0 for each tweet in tweets: words = preprocess(tweet) total_length += length of words for each word in words: if word not in term_freq: term_freq[word] = 0 term_freq[word] += 1 return term_freq, total_length // Calculate Probability, Chi-Square, and Discriminative Score function calculate_discriminative_score(term_freq_c1,term_freq_c2, total_length_c1, total_length_c2): initialize scores as an empty dictionary for each term in union of keys in term_freq_c1 and term_freq_c2: tf_c1 = term_freq_c1.get(term, 0) tf_c2 = term_freq_c2.get(term, 0) p_c1 = tf_c1 / total_length_c1 p_c2 = tf_c2 / total_length_c2 delta_p = absolute value of (p_c1 − p_c2) e_c1 = (tf_c1 + tf_c2) * total_length_c1 / (total_length_c1 + total_length_c2) e_c2 = (tf_c1 + tf_c2) * total_length_c2 / (total_length_c1 + total_length_c2) chi_square = ((tf_c1 − e_c1)^2 / e_c1) + ((tf_c2 − e_c2)^2 / e_c2) scores[term] = delta_p + chi_square return scores // Feature Selection function select_top_features(scores, top_n): sorted_terms = sort scores by value in descending order return first top_n items from sorted_terms // Main Function function main(tweets_c1, tweets_c2, top_n): term_freq_c1, total_length_c1 = calculate_term_frequencies(tweets_c1, ‘class1’) term_freq_c2, total_length_c2 = calculate_term_frequencies(tweets_c2, ‘class2’) scores = calculate_discriminative_score(term_freq_c1, term_freq_c2, total_length_c1, total_length_c2) top_features = select_top_features(scores, top_n) return top_features |
- Top 500 Features: We first selected the top 500 most discriminative features based on their scores. These features are expected to have the highest impact on distinguishing between male and female language in GenAI-generated tweets and human-authored tweets.
- Top 1000 Features: In the second category, we extended our selection to the top 1000 most discriminative features. By including a larger set of features, we aim to capture more characteristics variations in gender-specific language. This broader selection helps ensure that detailed but potentially important linguistic patterns are not overlooked.
- All Selected Features: Finally, we compiled a comprehensive set of all the features that were identified as discriminative, regardless of their rank. This complete set includes every term that demonstrates a statistically significant difference in usage between male and female categories. Using this extensive set allows us to fully explore the complexity of gender-specific language in GenAI-generated tweets and human-authored tweets and provides a robust basis for our predictive models.
4. Experimental Results
4.1. Evaluation
- Feature 500: When trained on a feature set consisting of 500 features, the MLP classifier demonstrates the highest performance across all metrics, achieving an accuracy of 83%. SVM follows with slightly lower scores of 81% accuracy. RF obtained 78%, DT scored 80%, and the NB classifier shows the lowest performance in this feature set with an accuracy of 76%.
- Feature 1000: When trained on a feature set consisting of 1000 features, the MLP continues to outperform the other classifiers, achieving an accuracy of 86%. SVM shows strong performance as well, with a slightly lower accuracy score of 84%. RF obtained 80%, DT scored 81%, and NB showed improved performance compared to the 500-feature set, indicating that increasing the number of features enhances model performance, with an accuracy of 77%.
- All Features: When trained on a feature set consisting of all features, the MLP achieves the highest scores across all metrics, with an accuracy of an accuracy at 90%; this was followed by SVM at 87%, RF at 84%, and DT at 85%. NB, though improved, remains the lowest performer with an accuracy of 80%.
4.2. Observation
5. Conclusions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Alowibdi, J.S. A human-authored or GenAI-generated: Who is creating the content. Eng. Technol. Appl. Sci. Res. 2024; in press. [Google Scholar]
- Alowibdi, J.S.; Buy, U.A.; Yu, P.S. Language Independent Gender Classification on Twitter. In Proceedings of the 2013 12th International Conference on Machine Learning and Applications, Niagara, ON, Canada, 25–28 August 2013; Volume 1, pp. 365–369. [Google Scholar]
- Alowibdi, J.S.; Buy, U.A.; Yu, P.S. Empirical Evaluation of Profile Characteristics for Gender Classification on Twitter. In Proceedings of the 2013 12th International Conference on Machine Learning and Applications, Miami, FL, USA, 4–7 December 2013; Volume 1, pp. 365–369. [Google Scholar]
- Alowibdi, J.S.; Buy, U.A.; Yu, P.S.; Ghani, S.; Mokbel, M. Deception Detection in Twitter. Soc. Netw. Anal. Min. 2015, 5, 32. [Google Scholar] [CrossRef]
- OpenAI. ChatGPT (March 15 Version) [Large Language Model]. 2024. Available online: https://chat.openai.com (accessed on 20 May 2024).
- Lai, J.W. Adapting Self-Regulated Learning in an Age of Generative Artificial Intelligence Chatbots. Future Internet 2024, 16, 218. [Google Scholar] [CrossRef]
- Susnjak, T.; McIntosh, T.R. ChatGPT: The End of Online Exam Integrity? Educ. Sci. 2024, 14, 656. [Google Scholar] [CrossRef]
- Ali, D.; Fatemi, Y.; Boskabadi, E.; Nikfar, M.; Ugwuoke, J.; Ali, H. ChatGPT in Teaching and Learning: A Systematic Review. Educ. Sci. 2024, 14, 643. [Google Scholar] [CrossRef]
- Gu, J. Responsible Generative AI: What to Generate and What Not. arXiv 2024, arXiv:2404.05783. [Google Scholar]
- García-Peñalvo, F.; Vázquez-Ingelmo, A. What do we mean by GenAI? A systematic mapping of the evolution, trends, and techniques involved in Generative AI. Int. J. Interact. Multimed. Artif. Intell. 2023, 8. [Google Scholar] [CrossRef]
- Kumar, R.; Mindzak, M. Who Wrote This? Detecting Artificial Intelligence–Generated Text from Human-Written Text. Can. Perspect. Acad. Integr. 2024, 7. [Google Scholar] [CrossRef]
- Yan, L.; Martinez-Maldonado, R.; Gasevic, D. Generative Artificial Intelligence in Learning Analytics: Contextualising Opportunities and Challenges through the Learning Analytics Cycle. In Proceedings of the 14th Learning Analytics and Knowledge Conference, Kyoto, Japan, 18–22 March 2024; pp. 101–111. [Google Scholar]
- Peersman, C.; Daelemans, W.; Van Vaerenbergh, L. Predicting age and gender in online social networks. In Proceedings of the 3rd International Workshop on Search and Mining User-Generated Contents, Glasgow, UK, 28 October 2011; pp. 37–44. [Google Scholar]
- Merler, M.; Cao, L.; Smith, J.R. You are what you tweet… pic! Gender prediction based on semantic analysis of social media images. In Proceedings of the 2015 IEEE International Conference on Multimedia and Expo (ICME), Turin, Italy, 29 June–3 July 2015. [Google Scholar]
- Çelik, Ö.; Aslan, A.F. Gender prediction from social media comments with artificial intelligence. Sak. Univ. J. Sci. 2019, 23, 1256–1264. [Google Scholar] [CrossRef]
- Reddy, T.R.; Vardhan, B.V.; Reddy, P.V. N-gram approach for gender prediction. In Proceedings of the 2017 IEEE 7th International Advance Computing Conference (IACC), Hyderabad, India, 5–7 January 2017; pp. 860–865. [Google Scholar]
- Krüger, S.; Hermann, B. Can an online service predict gender? On the state-of-the-art in gender identification from texts. In Proceedings of the 2019 IEEE/ACM 2nd International Workshop on Gender Equality in Software Engineering (GE), Montreal, QC, Canada, 27 May 2019. [Google Scholar]
- Bamman, D.; Eisenstein, J.; Schnoebelen, T. Gender identity and lexical variation in social media. J. Socioling. 2014, 18, 135–160. [Google Scholar] [CrossRef]
Feature | Male GenAI-Generated | Female GenAI-Generated |
---|---|---|
Lexical and Vocabulary | Use of assertive and technical vocabulary (e.g., “achieve”, “optimize”) | Use of collaborative and empathetic vocabulary (e.g., “support”, “understanding”) |
Structure and Syntax | Direct and straightforward sentences; focus on facts and outcomes | Complex sentence structures; conversational and engaging style |
Use of Pronouns | Frequent use of “I” and “we,” emphasizing individual/group achievements | Inclusive pronouns like “we”, “us”, and frequent “you” for direct engagement |
Emotional Tone | Neutral or objective tone; minimal emotional expression | Wide range of emotions; empathy, warmth, and support |
Hashtag Usage | Related to industry-specific topics, technology, current events; typically placed at the end | Related to social issues, personal experiences, community-building; integrated into the tweet body |
Punctuation and Grammar | Formal punctuation; fewer grammatical errors; less frequent use of exclamation marks | Expressive punctuation; use of exclamation marks, ellipses; personal touch |
Use of Emojis | Less frequent use of emojis; professional contexts | Frequent use of emojis; enhance emotional expression and relatability |
Male GenAI-Generated Tweets | Female GenAI-Generated Tweets |
---|---|
Just achieved a new milestone in our project! #success | So excited to share this milestone with everyone! #success |
Optimize your workflow with these tools. #productivity | These tools can really help us streamline our tasks! #productivity |
Our team will be discussing the new strategy tomorrow. #business | Can’t wait to brainstorm the new strategy with the team tomorrow! #business |
Here are the latest stats on our performance. #data | Check out these interesting stats! Let’s dive in together. #data |
Developing new tech solutions to drive innovation. #technology | Thrilled to be part of developing innovative tech solutions! #technology |
Results show a significant increase in productivity. #results | The results are in and they look great! #results |
Stay focused and achieve your goals. #motivation | You’ve got this! Keep pushing towards your goals. #motivation |
Join us for a webinar on the latest trends in AI. #webinar | Can’t wait for the webinar on the latest AI trends! Hope to see you there. #webinar |
Analyze these figures for a clearer picture. #analysis | Let’s dive into these figures for a better understanding. #analysis |
Implement these strategies to enhance your skills. #development | These strategies can really help you grow! #development |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Alowibdi, J.S. Gender Prediction of Generated Tweets Using Generative AI. Information 2024, 15, 452. https://doi.org/10.3390/info15080452
Alowibdi JS. Gender Prediction of Generated Tweets Using Generative AI. Information. 2024; 15(8):452. https://doi.org/10.3390/info15080452
Chicago/Turabian StyleAlowibdi, Jalal S. 2024. "Gender Prediction of Generated Tweets Using Generative AI" Information 15, no. 8: 452. https://doi.org/10.3390/info15080452
APA StyleAlowibdi, J. S. (2024). Gender Prediction of Generated Tweets Using Generative AI. Information, 15(8), 452. https://doi.org/10.3390/info15080452