Less Is More: Analyzing Text Abstraction Levels for Gender and Age Recognition Across Question-Answering Communities

Figueroa, Alejandro

doi:10.3390/info16070602

Open AccessArticle

Less Is More: Analyzing Text Abstraction Levels for Gender and Age Recognition Across Question-Answering Communities

by

Alejandro Figueroa

Departamento de Informática y Computación, Universidad Tecnológica Metropolitana, Santiago 7800002, Chile

Information 2025, 16(7), 602; https://doi.org/10.3390/info16070602

Submission received: 2 June 2025 / Revised: 2 July 2025 / Accepted: 10 July 2025 / Published: 13 July 2025

(This article belongs to the Section Information Applications)

Download

Browse Figures

Versions Notes

Abstract

In social networks like community Question-Answering (cQA) services, members interact with each other by asking and answering each other’s questions. This way they find counsel and solutions to very specific real-life situations. Thus, it is safe to say that community fellows log into this kind of social network with the goal of satisfying information needs that cannot be readily resolved via traditional web searches. And in order to expedite this process, these platforms also allow registered, and many times unregistered, internauts to browse their archives. As a means of encouraging fruitful interactions, these websites need to be efficient when displaying contextualized/personalized material and when connecting unresolved questions to people willing to help. Here, demographic factors (i.e., gender) together with frontier deep neural networks have proved to be instrumental in adequately overcoming these challenges. In fact, current approaches have demonstrated that it is perfectly plausible to achieve high gender classification rates by inspecting profile images or textual interactions. This work advances this body of knowledge by leveraging lexicalized dependency paths to control the level of abstraction across texts. Our qualitative results suggest that cost-efficient approaches exploit distilled frontier deep architectures (i.e., DistillRoBERTa) and coarse-grained semantic information embodied in the first three levels of the respective dependency tree. Our outcomes also indicate that relative/prepositional clauses conveying geographical locations, relationships, and finance yield a marginal contribution when they show up deep in dependency trees.

Keywords:

gender identification; age identification; community question-answering sites; engagement and participation in online communities; transformers

1. Introduction

Community-based Question-Answering platforms including Yahoo! Answers (An archived link can be found at https://archive.fo/agAtD, since it was shut down in 2021), Reddit (https://www.reddit.com/), and Stack Exchange (https://stackexchange.com/), as well as Quora (https://www.quora.com/), are visited by millions of users worldwide on a daily basis. The goal of these internauts is browsing and asking complex, subjective, and/or context-dependent questions. One of the key factors in their success relies on community fellows providing timely responses tailored to specific real-world situations. These types of answers are unlikely to be discovered via querying web search engines. Hence quickly matching community peers that are suitable and willing to help is crucial to continue the vibrancy of this sort of network.

For the sake of establishing constructive connections, cQA systems exploit frontier information technologies to harness the collective intelligence of the whole online social platform [1,2]. In so doing, demographic profiling of community members has shown to be instrumental. For instance, automatically identifying geographic locations is vital to reduce delays in obtaining the first acceptable answers, genders to reward users, and generational cohorts to receive suitable answers [3,4,5,6,7,8,9]. In achieving this, recent advances in machine learning, especially deep neural networks, have played a pivotal role in enhancing user experience across cQA sites.

In fact, the eruption of pre-trained models (PTMs) such as BERT during the last five years has drastically changed the field by making it much easier for any individual to develop innovative and cost-efficient solutions for knotty problems. In this framework, a PTM is trained on massive amounts of corpora and adapted to a particular task afterwards. They are used as a starting point for developing other machine learning-oriented solutions since they offer a set of initial weights and biases that any person can adjust to a specific downstream task later. In the realm of cQAs, this sort of approach has produced auspicious results in many tasks including the automatic identification of demographic variables, namely age and gender [10,11].

In brief, this piece of research explores a novel criterion to calibrate these weights and biases based on their semantic contribution instead of using global frequencies only. Simply put, we aim to determine what occurrences of a particular term are convenient to consider when both fine-tuning and classifying new contexts. We hypothesize that sometimes a word should be taken into account and sometimes it should not, even though this might be a highly frequent term across the domain collection. To determine its semantic contribution to a specific context, we look at its depth in its corresponding lexicalized dependency tree. Therefore, by trimming dependency trees at different levels (depths), we can automatically generate several abstractions for a given text in consonance with their semantic granularity.

Hence, the novelty of this paper lies in quantifying the variation in the performance of a handful of state-of-the-art PTMs when systematically increasing the amount of details within their textual inputs. With this in mind, our contributions to this body of knowledge are summarized as follows:

By systematically varying the granularity of input contexts when fine-tuning PTMs, we aim to find out the point at which it turns detrimental or inefficient to add more detailed information.
Our test beds include two classification tasks (i.e., age and gender recognition) and three cQA collections of radically distinctive nature and size, more specifically Reddit, Yahoo! Answers, and Stack Exchange. Furthermore, our empirical configurations additionally target four combinations of input signals such as question titles and bodies, answers, and self-descriptions.
By zooming into our experimental results, we determine the types of syntactics and semantics that marginally contribute to enhancing performance.
We also analyzed whether our discoveries are consistent across both downstream tasks and whether their respective best models are strongly correlated or not.

In short, our empirical figures support the use of cost-efficient technologies that exploit distilled frontier deep architectures and coarse-grained semantic information (i.e., terms contained in the first three levels of the respective lexicalized dependency tree). The roadmap of this work is as follows. First, Section 2 fleshes out the works of our forerunners; subsequently, Section 3 delineates our research questions. Next, Section 4 and Section 5 deal at length with our methodology and experimental results, respectively. Eventually, Section 6 and Section 7 set forth our key findings, and the limitations of our approach, and outline some future research directions.

2. Related Work

In reality, there is a mere number of studies digging deeper into identifying age and gender across cQA members, as evidenced by numerous recent surveys in this area [1,2,12,13,14]. A brief summary of the key findings in the literature is as follows:

Age:
–
React more favorably to fellows around their age.
–
Sentimentality grows progressively less with aging.
–
Reputation scores increase with age until 50.
–
In their 30s, they deal with fewer topics with regard to younger/older peers.
–
Smooth transition from one age cohort to the next.
–
Relationship with question intent.
Gender:
–
Informative cues: age, industry, and second-level question categories.
–
Males are more neutral, whereas women are more positive.
–
Females are more sentimental when answering questions.
–
Feminine peers tend to deal with questions targeting opinions, while men deal with factual questions.
–
Masculine members are more negative when responding.
–
Feminines ask more, whereas males answer more.
–
Fine-tuning affected by pre-training on clean corpora.

2.1. Age Screening

The work of [15] pioneered the exploration of sentiment-based patterns across Yahoo! Answers as they relate to age. They discovered that community members are inclined to react more favorably to peers around their age, and that sentimentality grows progressively less with aging. The study of [16] also took the lead by examining age trends across programming gurus in StackOverflow. They found out that reputation scores increase in tandem with age until community fellows reach their 50s, and in their 30s, they deal with fewer areas with regard to younger or older members, although a strong correlation between age and scores did not exist in any specific knowledge area.

By default, age identification is conceived as a regression task. For instance, experiments conducted by [17] revealed that age-based centroid vectors form an age-ordered sequence in graph-based embeddings trained from the activity of the community. In a juxtaposition, Ref. [18] tried a handful of ways of putting community peers together in consonance with their birth year in order to cast age recognition as a classification problem. As a result, the best way is to reduce the archetypal five generations proposed by Strauss and Howe [19] to three via merging its oldest three cohorts into one.

The work of [17] tested several high-dimensional vector spaces built upon characteristics extracted from texts and metadata on widely used statistical approaches and state-of-the-art deep neural networks. Consistent with [15], sentiment analysis proved to be effective, and when it comes to machine learning algorithms, FastText and MaxEnt. A qualitative finding of this study regards clear and marked similarities (i.e., the relevance of sentiment analysis) between effective models for distinguishing age cohorts and earlier models devised to tackle question intent [20]. Eventually, Ref. [21] substantiated this claim by comparing classification rates produced by assorted single-task and multi-task frameworks.

Recently, the study of [11] assessed the performance of nineteen cutting-edge text-oriented transformer-based (e.g., BERT and ELECTRA) age recognizers on massive cQA corpora. According to their experiments, Longformer is the best encoder when exploiting full questions and answers [22]. They additionally confirm that the transition from one age group to the previous/next is smooth, and for this reason, boundary individuals pose a tough challenge.

2.2. Gender Recognition

In the domain of computer vision, image-based pre-trained neural networks and heuristic methods have been employed in automatically predicting genders on several cQA platforms [23,24]. Essentially, the results indicate that multifarious non-facial avatars present a difficult, but interesting, challenge even for ocular inspections [23].

As for text analysis, ref. [4] took the initiative by automatically discriminating the gender of an asker of an isolated question by examining its wording, metadata, demographics, and web searches. Three supervised strategies were trained considering massive corpora and a wide diversity of high-dimensional vector spaces. Most informative properties included age, industry, and second-level question categories regardless if these were inputted by the community member when enrolling or if these were inferred from its text via semantic and dependency analyses.

The investigation of [15] into Yahoo! Answers unveiled that males are more neutral, whereas women are more positive in their questions and responses. Females are also more sentimental when answering questions. In the graphic design community of Stack Exchange and Quora, ref. [25] found out that feminine peers are more likely to deal with questions targeting opinions. Conversely, men answer more factual questions on Stack Exchange. On both sites, the tone of masculine members is more negative than their counterparts when responding, although this difference was not statistically significant.

In Stack Overflow, females, who bump into other women, are more likely to engage sooner than those who do not [26]. Further, there is a greater tendency among females to ask more, while males tended to answer more, resulting in fewer thumbs-ups and giving rise to lower average reputation scores for females [7,27,28]. Furthermore, feminine fellows obtained lower scores when responding on Stack Overflow, despite exhibiting higher efforts in their contributions, revealing some gender bias in the scoring of answers [9]. This bias coupled with gamification strategies including scores and badges (typically more attractive to men than to women) supports the necessity of devising alternative approaches to promote women’s participation, particularly if the cQA site allows anonymity and gender information is not available [7].

In the same spirit of [11], the work of [10] quantified the classification rate of assorted cutting-edge transformer-based models on a large-scale corpus of Yahoo! Answers members. Like age, the best transformer models took full questions and answers into account, but in this case, DeBERTa and MobileBERT outclassed their rivals. Unlike age, fine-tuning user-generated content was affected by pre-training on clean corpora.

2.3. Pre-Trained Encoders

Transformer models have made a crucial breakthrough by pre-training language model objectives on large networks with large-scale unlabeled data and adjusting these architectures to downstream tasks afterwards (e.g., text classification and machine translation) [29,30,31,32]. Since the first inception of OpenAI GPT and BERT [33,34], this first generation has considerably improved and continues to improve from all kinds of angles [30,35,36]. In spite of these recent advancements and being the current default solution for most NLP tasks [30], they still face many challenges. Some of them include designing effective architectures, utilizing rich contexts, generalization, maximizing computational efficiency, conducting interpretation and theoretical analysis, and temporality as well [37,38].

In substance, PTMs represent knowledge as real-valued vectors and seem to imitate well-known tree structures [39] with the goal of mimicking the different steps of the conventional NLP pipeline [40]. For instance, lower layers in BERT capture linear word order and long-distance dependencies (e.g., subject–verb agreements) [39,41]. In contrast to its attention weights, which have shown to be weak indicators of subject–verb agreements and reflexive anaphora [41]. While there is a widespread agreement on syntactic information being encoded most prominently in middle layers [42], there is considerable dispute over semantic features as it relates to being stored at the top or throughout the entire model [40]. In the case of surface features, these are codified at the bottom.

On the other hand, there is a growing amount of research disputing the impressiveness of these language skills [42]. A case in point is their heavy reliance upon shallow heuristics when categorizing texts [43] and their failure in reasoning on top of their large number of archived facts [44,45]. By the same token, taking into account extra corpora has shown to be of little use in tackling forgetting head-on, especially after fine-tuning [46]. In effect, it is unclear how well their properties will port to domains with higher variability in syntactic structures (e.g., noisy user-generated content) and/or with more flexible word orders, as in morphologically richer languages [39]. Last but not least, it has been shown recently that these architectures require adaptations to grab the bursty usage of topical words, especially due to event-driven changes in language use in some downstream tasks such as classification [38].

3. Research Questions

By leveraging graph structures in lexicalized dependency trees, we quantified the impact of different text-based semantic abstraction levels on the classification rate. In so doing, six cutting-edge pre-trained encoders were fine-tuned for gender and age identification as downstream tasks.

To this end, three sharply different cQA collections were considered (i.e., Yahoo! Answers, Reddit, and Stack Exchange) together with different combinations of the various textual contents found across member profiles (i.e., question titles and bodies, answers, and self-descriptions).

Specifically, our primary goal is answering the following four research questions:

RQ1: How much does the performance vary by systematically adding more fine-grained semantic content?
RQ2: In qualitative terms, are findings consistent across cQA platforms and tasks?
RQ3: What kind of pieces of information made a marginal or no contribution at all?
RQ4: Are the best models for each task strongly correlated?

4. Methodology

In the first place, ten different linguistically motivated textual abstractions were produced for each community peer by processing his/her questions, answers/comments, and short bio. For this purpose, we took into account the level of granularity (information details) carried by each word across all these texts. In the same spirit of [47,48] (see also [49]), we profited from Natural Language Processing (NLP) to generate these abstractions, specifically from lexicalized dependency trees computed by CoreNLP (https://corenlp.run/). Essentially, the level of granularity of a term in a sentence conforms to its corresponding depth in its respective lexicalized tree, and thus the level of abstraction of a given text can be controlled by trimming dependency trees at different depths (see illustrative samples in Figure 1). It is worth underlining here that word depth is defined as the number of edges needed to traverse in order to reach that particular term from the root node. In dependency trees, words in the topmost levels bear the largest amount of semantics, whereas terms at the nethermost are more likely to provide the details. Figure 2 serves as an illustrative example of this transformation according to four different depths for a given user. Lastly, note that stop-words and punctuation were also removed from the processed corpora, but we kept both in Figure 2 for the sake of readability and clarity. In our empirical settings, we considered ten distinct abstraction levels by pruning dependency trees at depths from zero (i.e., root nodes only) to nine.

Secondly, as a way of assessing the appropriateness of each of these abstractions for classifying age and gender, we utilized six well-known frontier pre-trained neural networks. The first one is the widely used BERT (Bidirectional Encoder Representations from Transformers) (cf. [33,34]). This architecture encompasses a multi-layer bidirectional transformer trained on clean plain text for masked words and next sentence prediction [31,34]. It predicates upon the old principle that words are, to at least a great extent, defined by other terms within the same context [50]. BERT is composed of twelve transformer blocks and twelve self-attention heads with a hidden state of 768. Our experimental configurations made allowances for five additional state-of-the-art BERT-inspired architectures: ALBERT (A Lite BERT), DistilBERT, DistilRoBERTa, ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately), and RoBERTa [51,52,53,54]. Take, for instance, ELECTRA building a discriminator (transformer) that determines whether every token is an original or a replacement, instead of only masking a fraction of tokens within the input [53]. A generator, another neural network, masks and substitutes tokens to generate corrupted samples. This architecture is more efficient than BERT since it demands significantly less computation while achieving competitive accuracy on several downstream tasks.

Thirdly, these six PTMs were adjusted to two different downstream tasks, i.e., name and gender recognition. Roughly speaking, the main investigative question of this work is determining the point (abstraction level) where it is better to count on semantic contributions than on global term frequencies computed from the target collection when fine-tuning. It is worth recalling here that it is standard practice to preserve only terms with a frequency higher than a minimum threshold during model adjustment. By and large, and also in this work, this parameter is set to five. This entails that new model fits rely on inferences embodied in pre-trained weights every time an unseen or an eliminated low-frequent word shows up in new contexts. Put differently, we hypothesize that global frequencies are just one important factor. We conjecture that another key aspect is the amount of semantics that each term carries within the context being analyzed. That is to say, there are some words in some contexts that are irrelevant or detrimental to their classification despite the correct deduction of their meanings due to being highly recurrent across the target task domain. This might happen because the most prominent meanings of some very common terms have nothing to do with their particular usage within specific contexts, and thus these words might introduce some noise when categorizing. For this reason, we claim that words must also be picked as features according to their semantic contribution to the particular context instead of using only a filter in consonance with a given threshold. To put it in another way, we keep the traditional minimum frequency threshold but we additionally remove words from each context in conformity to their semantic contribution to that specific text. In contrast to the conventional threshold, our filter is context-dependent.

Lastly, with regard to fine-tuning, we capitalized on the implementations provided by Hugging Face (https://huggingface.co/). By and large, we chose default parameter settings to level the grounds and reduce the experimental workload. At all times, two epochs were set during model adjustment, and hence the time for fine-tuning was restricted to five days in the case of the most computationally demanding models. It is worth noting here that going beyond one epoch did not show any significant extra refinement, but we intentionally gave all encoders enough time to converge. The maximum sequence length was equal to 512. As for the batch size, this was set to make sure that the corresponding GPU memory usage reached its limit, namely eight; this way, it always allowed convergence. In our experiments, we used sixteen NVIDIA A16 (16gb) Tesla GPU cards. On a final note, it is worth pointing out that we employed half precision (fp16) format when working with all models but ELECTRA.

5. Experiments

This study made allowances for three fundamentally distinct collections, which belong to three cQA websites that serve different purposes and demographics, i.e., Reddit, Stack Exchange, and Yahoo! Answers. Accordingly, these corpora are described below as follows (see Table 1):

Yahoo! Answers (from now on referred to as YA) is a collection composed of 53 million question–answer pages in addition to twelve million member profiles. This corpus was gathered by [6,18,55] in an effort to perform a gender and age analysis on this cQA site. These profiles contain the respective questions, answers, nicknames, and short bios. From previous studies [10,11,17,24,56], this collection also identifies a total of 548,375 community peers with both their age and gender.
Stack Exchange (SE, for short) regularly publishes its data dumps on the Internet for public use (https://archive.org/download/stackexchange/). More concretely, we profited from the version published on 6 September 2021. Each record includes the corresponding questions, answers (called comments), nicknames, and self-descriptions. As a means of tagging members with their age and gender, we employed a strategy similar to previous works by taking into account all its 173 sub-communities (cf. [6,24,56]). On the whole, we obtained 525 elements as a working corpus. It is worth noting here that age could be automatically recognized for about eight thousand members only.
Reddit makes its repository accessible to everyone via Project Arctic Shift (https://github.com/ArthurHeitmann/arctic_shift?tab=readme-ov-file, accessed on 9 July 2025). Since decompressing this collection requires “seemingly infinite disk space”, we capitalized on the dumps offered by Pushshift (https://files.pushshift.io/reddit/, accessed on 9 July 2025), which encompasses more than 1.3 billion questions (called submissions) posted before July 2021 and almost 350 billion answers (referred to as comments) [57]. Likewise, we identified age and gender across a random subset of this dataset by means of the same automatic processing utilized for YA and SE. Overall, this resulted in 68,717 profiles labeled with both demographic variables. Unlike both previous repositories, these records provide submissions, comments, and aliases, but not short bios.

Table 1. Datasets descriptions.

Dataset	No. Samples	Male/Female	Gen Z/Gen Y/Olders
YA	548,375	37.33%/62.67%	49.07%/41.87%/9.05%
Reddit	68,717	60.74%/39.26%	40.84%/45.48%/13.68%
SE	525	80.57%/19.43%	40.57%/32.38%/27.05%

It is worth underlining here that we chose these three cQA datasets not only because of their inherent differences but also to carry out experiments on collections of strikingly different sizes. For all empirical purposes, working datasets were randomly divided into training (60%), evaluation (20%), and testing (20%) as well. Accordingly, held-out evaluations were carried out by keeping these three splits fixed, and the test material was used only for yielding unbiased assessments of final model fits on the training/evaluation folders. With regard to metrics, accuracy was employed in assessing gender (two-sided) models, whereas Macro F1-Score was employed in evaluating age, since this is a three-generation task, specifically Gen Z, Gen Y, and older peers (cf. [11,18]).

Four empirical scenarios were considered in order to study the impact of abstraction levels (aside from analyzing both demographic factors independently). These scenarios are signaled by the following abbreviations:

T (question/submission titles only).
TB (questions/submission titles plus their bodies).
TBA (full questions/submissions coupled with answers/comments).
TBAD (full questions/submissions, answers/comments, and short bios).

Table 2, Table 3 and Table 4 and Table 5 and Table 6 highlight the outcomes accomplished by different combinations of encoders and configurations as they relate to both age and gender identification, respectively. A bird’s eye view of the results points towards the substantial impact of the collection size on the classification rate. In a statement, our figures show that the larger the better. Broadly speaking, there was not an overwhelmingly dominant encoder in the case of SE (see Table 4), while it is reasonably clear that DistillRoBERTa outclassed its rivals on YA and Reddit. Further, the results seem to be somehow random when it comes to SE. For instance, both T and TBAD appear to have a competitive model, but on average, T outperforms TBDA by 11.40%. These dissonant figures signify ineffective learning, which confirms some recent discoveries about fine-tuning transformers regarding their bad performance when adjusted to downstream tasks via datasets distilled from “non-standard”, especially small, collections [58,59,60]. We recall here that pre-trained models are built largely upon clean corpora including books, Wikipedia, and news articles (cf. [10]), whereas Stack Overflow, a programming community inside SE, takes the lion’s share of the SE corpus. This entails that this material is markedly biased towards texts like coding snippets, which can hardly be found across clean “standard” training corpora. Overall, our figures point towards at least ca. 70,000 samples as the desirable amount to achieve some good prediction rates for both tasks, especially via DistillRoBERTa. On a side note, the results for gender prediction on SE were not reported here since differences in classification rates across distinct empirical settings were negligible due to an additional third factor: class imbalance. To be more concrete, SE is a community strongly biased towards men [7], and in our collection, only 19.43% of the instances belong to the women category. Hence, for the sake of reliability, from now on, we performed an in-depth analysis of the results obtained when working on YA and Reddit only.

Another general conclusion regards self-descriptions. Their contribution was shown to be unpredictable. Although they turned out to be slightly detrimental to the best models, other competitive configurations were marginally enhanced by incorporating these training signals. This unpredictability might be due to their sparseness since about solely 7% of the profiles yield bios [10]. This suggests that these profile descriptions might be discarded without significantly compromising performance. On the flip side, this puts forward the idea that it is still plausible to obtain better classification rates by exploiting additional (probably multi-modal) training sources, such as images and activity patterns. This is particularly insightful for platforms like Reddit, where short bios are unavailable.

5.1. Age Prediction

Table 2 and Table 3 display the figures obtained for identifying age. In light of these outcomes, we can draw the following conclusions:

Essentially, taking into account terms embodied deeper than the third level resulted in relatively small improvements. This means that most of the semantic cues necessary for guessing age can be found at the highest levels of the dependency trees. Although there are some refinements until level seven in most cases, the performance tends to converge asymptotically.
Interestingly enough, despite the overall supremacy of DistillRoBERTa, RoBERTa finished with the best model every time the Reddit collection was targeted, while in the event of YA, this happened in two out of four cases only. Some key facts about the relationship between these two models are the following: (a) the former was distilled from the base version of the latter; (b) DistillRoBERTa follows the same training procedure as DistilBERT [52,61]; (c) DistillRoBERTa consists of 6 layers, 768 dimensions, and 12 heads, totalizing 82 million parameters (compared to 125 million parameters utilized by RoBERTa-base); and (d) DistilRoBERTa is twice as fast as RoBERTa-base on average.
It is worth recalling here that knowledge distillation is a compression technique in which a compact model (the student) is built to reproduce the behavior of a larger model (the teacher) via a loss function over the soft target probabilities of the bigger model [62]. Note that empirical results show that distillation retains ca. 97% of the language understanding capabilities of the larger model while having about 60% of its parameters. Analogously, trimming lexicalized dependency trees reduced the size of the training vocabulary and contexts, or alternatively, one can fit more relevant information in the same context window while achieving competitive classification rates.
The previous finding entails that not every word, and not all instances of a particular term, significantly contribute to deducing/refining its meanings or the distribution over its potential usages during fine-tuning. Chiefly, relatively medium-high frequency words do not need their occurrences, where they do not significantly contribute to their corresponding contexts (i.e., deep in the dependency tree), to yield a clearer understanding of their own sets of meanings.
To be more concrete, the reduction in vocabulary size from depth nine to three is around 88–89% for TBAD (YA) and TBA (Reddit) models. Despite this subtraction, most of the performance is preserved (see Table 2 and Table 3). Figure 3 and Figure 4 depict the largest decrements in frequency across the twenty thousand most recurrent terms across both the YA and Reddit corpus, respectively. Interestingly enough, discourse markers such as “whose”, “whom”, and “which” suffered from significant frequency decreases together with prepositions such as “including” and “such as” that denote parts of a whole already mentioned in the context. Furthermore, we also found variations of words like “etc.,” “i.e.,” and “e.g.,” which all signal that very specific information has been given or is going to be mentioned. All in all, these terms do not need these instances to specify their meanings, as they are just serving a syntactic function at this level.
In order to have a more concrete idea of the semantics conveyed by these prepositional/relative clauses, Figure 5 and Figure 6 highlight the word clouds generated from the sharpest one thousand drops in frequency within the twenty thousand most recurrent terms for both YA and Reddit, respectively. In both cases, locations such as cardinal points and the Internet obtained the lion’s share. Another common topic involves intimacy, relationships, and emotions (e.g., “abusive”, “adult”, “divorce”, “happiness”, “marriage”, “separate”, “feelings”, “sexual”, and “violent”). As for the Internet, we discovered words such as “traffic”, “web”, “security”, and “network”. Distinct from these topics, we can also identify terms related to finance, namely “dollars”, “funds”, “tax”, “federal”, and “interests”. Basically, any of these subjects is very likely to be the main talking point of several submissions. In fact, there is a plethora of community fellows that share these sorts of interests. Hence it is not far-fetched to think that these “deeper” instances do not greatly contribute during fine-tuning to adjust their meanings or the distribution of their meanings. Furthermore, our outcomes suggest that these “deeper” occurrences are less helpful at testing time also, as a result of users over-elaborating their contexts or as a consequence of topic redundancy across all their posts.

In summary, our qualitative results point towards cost-efficient models that make inferences by looking at the most abstract semantic information within sentences (first three levels). Our results also reveal that distilled encoders play a pivotal role in building this type of solution. But for applications that need to walk the extra mile, that is to say, where any small gain in classification rate is vital, fine-tuning undistilled models coupled with terms embodied in the first seven levels seems to be a better recipe for success. Instead of accounting for fine-grained details, our figures suggest that extra semantically coarse-grained samples are more likely to bring about some significant additional growth in the classification rate.

In quantitative terms, BERT finished with the best model most of the time when operating solely on question titles, especially for the YA collection. To be more precise, BERT reaped a Macro F1-Score of 0.5965 and 0.4619 for YA and Reddit, respectively. Curiously enough, terms embodied in the deepest levels of the dependency trees did not bring about improvements in the classification rate. One reason for this is that question titles seldom over-elaborate, and therefore trees are not typically that deep, contrary to answer and question bodies, which are normally long-winded.

In the event of responses and question contents, the best achieved Macro F1-Score was 0.7325 (YA) and 0.6156 (Reddit). These outcomes support the conclusion that a wider variety of contexts is more instrumental than fine-grained details since adding more contexts (i.e., bodies and answers) brings about more significant improvements. Note that the classification rate grew by 15.43% (0.0874 Macro F1 points) when accounting for question bodies, and on top of that, by 6.93% (0.0474 Macro F1 points) when considering answers as well, and a total of 22.80% (0.136 Macro F1 points) with regard to the best model for title-only (YA). Similarly, a rise of 19.74% (0.0912 points) via making allowances for question content, and on top of that, 11.30% (0.0625 Macro F1 points) by considering answers too. On the whole, a 33.28% (0.1537 Macro F1 points) increase was achieved with regard to the title-only best model for Reddit. Consequently, we can draw the conclusion that a wider diversity of coarse-grained contexts is preferred to more detailed information.

As aforementioned, the best models for YA and Reddit accomplished Macro F1-Scores of 0.7325 and 0.6156, respectively. Figure 7 underlines the corresponding Confusion Matrices and ROC (Receiver Operating Characteristic) curves for both approaches. Like previous studies [6,11,17,18], errors were shown to be more prominent between contiguous generations: older peers–GEN Y and GEN Y–GEN Z. In terms of AUC (Area under the Curve), a very similar performance was obtained for every class in the event of Reddit, whereas a markedly better score was accomplished for older peers and a considerably worse score for Gen Y.

To sum this up, our quantitative results indicate that fewer details, and thus more coarse-grained semantics, together with a larger diversity of contexts, enhance age recognition.

5.2. Gender Recognition

Despite the fact that a clear dominant depth level does not exist, most of the best models revolve around trimming dependency trees at the seventh level (see Table 5 and Table 6). Like in the event of age identification, considering words within the deepest levels of the lexicalized dependency trees assisted in marginally improving the accuracy. In the same way, our figures indicate that the first three levels carried most of the necessary semantic cues for predicting genders. But this time, DistillRoBERTa is not only the robust encoder, but it also finished with the best outcomes under most configurations.

In the same vein as age recognition, the qualitative results point towards cost-efficient strategies that operate on the most abstract semantic levels. And along the same lines, distilled architectures turned out to be fitter for this task as well. Similarly, experiments signaled that it is better to devote a greater effort to acquiring and processing a wider variety of coarse-grained contexts rather than focusing on designing more accurate parsers and on dealing with words that do not significantly contribute to the core message of their contexts. These “deep” words are likely to take the place of more relevant terms within the context window, and eventually, they unnecessarily increase the size of the vocabulary when fine-tuning.

In quantitative terms, the results indicate that over 70% of the time (up to almost 80%), genders can be guessed by inspecting the question titles only. In fact, the gains obtained by considering the question bodies reached 4.58% and 2.6% for YA and Reddit, respectively. Therefore, cost-efficient gender recognizers require less fine-grained contexts than age identifiers. Indeed, this conclusion is supported by the outcomes obtained when taking into account answers. Specifically, this source of text increased the accuracy by 1.45% and 4.41% for YA and Reddit, respectively. Both figures are substantially lower than for age screening.

Figure 8 depicts the Confusion Matrix and ROC curves for both best models. Interestingly enough, more masculine members are misclassified as women in the case of YA, whereas more females are misclassified as men when dealing with Reddit. All in all, our quantitative results suggest that fewer details are needed to build cost-efficient solutions, compared to the task of age recognition. Likewise, a large diversity of contexts is desirable to improve classification accuracy.

5.3. Gender vs. Age Recognition Best Models

One challenging question remains unanswered: how similar (i.e., correlated) are the best models for age and gender (working on the same cQA platform)? Both results show a preference for distilled models fine-tuned on a wider diversity of “more abstract” semantic contexts. Note also that the exact same collection and splits were utilized to adjust both recognizers. This entails that both strategies employ the same battery of features when fine-tuning. Only the target variable changes accordingly.

In order to find a response, we profit from Canonical Correlation Analysis (CCA) for estimating the correlation coefficient for the best linear combination (i.e., the one that maximizes their correlation) between the predictions outputted for both demographic factors. We obtained a linear correlation of 0.232734 and 0.168673 for Reddit and YA, respectively. Both values indicate a weak linear mutual relationship, hence giving evidence that we can neither guess the gender of an individual based on predictions for his/her age nor his/her age on the grounds of the outcomes for his/her gender. That is to say, a significant difference between adjusted weights for both tasks exists.

In addition, we inspected the misclassification rates reaped by both best models for both corpora. With regard to YA, both best models agreed on correctly or incorrectly guessing both demographic aspects in 67.23% and 3.68% of the cases, respectively. Therefore, only one of the demographic variables was wrongly predicted for 29.09% of the testing samples; that is to say, these are cases where one model failed and the other succeeded. Likewise, in 53.38% and 6.91% of the instances, both best transformers cast right and wrong predictions, respectively, when targeting Reddit. This entails that 39.7% of the time, only one of the two demographic factors was correctly inferred. On the whole, this 29–40% misalignment in performance substantiates the conclusion that both best models are significantly different.

5.4. Caveats

Essentially, community members were automatically labeled in conformity to the gender they identified themselves as and the age they expressed to be at a particular moment (timestamp) on the platform. As a means of having an estimate of the accuracy of the automatic tagging process, one hundred samples were assessed via an eyeball inspection. Overall, we found a 10% error rate, which can be partly attributed to the intrinsic shallow nature of the annotation approach. But, another part can be linked to fake profiles, individuals that willingly/unwillingly lie, or pretend to belong to another gender and/or age cohort, at least, occasionally. Needless to say, these factors bring also uncertainty to any manual labeling approach.

On a final note, this ocular inspection could not establish that non-binary sexual orientations have a significant share due probably to their discretion and/or low participation. As one would expect, it is difficult to compile or have access to a comprehensive list of their typifying names and phrases. Last but not least, we recall that both are retrospective datasets, i.e., up until 2018 (YA) and 2021 (Reddit).

6. Limitations and Future Research

Aside from the two previously discussed considerations regarding the objects of this study: the sparseness/lack of self-descriptions and the insufficiency of the SE corpus, there are a few extra aspects that we need to emphasize.

First off, the strong bias towards coding, and especially male programmers, inherent in the SE collection made it impossible to effectively fine-tune models. One way of tackling this head-on might be devising new automatic approaches (heuristics) to manually label the age and gender of a larger fraction of its members. However, SE fellows are very unlikely to explicitly tell their age when interacting on the website. Providing this kind of information is something unusual when talking about stuff related to coding. It is much more common to express the years of experience you have in the field instead. Nonetheless, it is conceivable that this experience might also be exploited for roughly estimating age cohorts in future works.

Secondly, the computed dependency trees are subject to errors since we capitalized on the standard CoreNLP models for dependency parsing. These tools are built upon cleaner corpora, and thus errors are expected when employed on user-generated content. In the same vein, these parsing models are designed to cope with English, and in the case of cQA texts, code-switching can happen in a single post or sentence, especially across language-related topics. This alternation between two or more languages affects the output of the parser, and many times, it also goes unnoticed by language detectors.

In fact, our limitations on computational power prevented us from testing larger transformers. In addition, these cutting-edge architectures could also be pre-trained on a large amount of user-generated cQA texts if the necessary infrastructure is available. In so doing, one could be in a position to infer more refined meanings for frequent community jargon, spellings, aliases, entities, and acronyms, for instance.

Additionally, our results suggest that pre-training title-only models would be beneficial for several reasons: (a) their grammar is sharply different from what we can find across question bodies and answers; (b) they encompass smaller succinct contexts that can be readily contained within traditional pre-training token windows; and (c) one could conjecture that the semantic range of words is comparatively limited since their usage is restricted to the syntax of relatively short questions. Hence, this leads us to believe that pre-training models solely on a massive amount of titles could aid in producing cost-efficient approaches to some downstream tasks.

Lastly, we also envision that exploiting multi-lingual encoders and texts written in different languages can assist in enhancing the classification rate on both tasks, especially across community peers linked to very few questions/answers published in English.

7. Conclusions

In brief, when it comes to demographic variables such as gender and age, our figures point towards cost-efficient strategies that capitalize upon abstract semantic information and distilled architectures. On the other hand, fine-tuning undistilled models operating on words extracted from the first seven levels of their respective lexicalized dependency trees was shown to be more suitable whenever the smallest enhancement in performance matters. Furthermore, our outcomes indicate that it is more productive with regard to its cost to devote extra effort into acquiring a wider range of semantically coarse-grained samples rather than accounting for fine-grained details contained across a relatively smaller amount of contexts.

Additionally, the best fine-tuned models for gender and age proved to be sharply different from each other. In quantitative terms, gender can be guessed, most of the time, by looking into only the question titles authored by the community fellow. Here, our empirical outcomes point towards DistillRoBERTa as the most robust alternative. Likewise, BERT tended to be the best approach when adjusting question titles for age prediction as a downstream task.

In closing, our results set forth the idea of reaping higher classification rates by taking advantage of supplementary (e.g., multi-modal) training sources, including images and activity patterns. This is particularly insightful for platforms like Reddit, where self-descriptions are unavailable.

Funding

This work was partially supported by the project Fondecyt “Multimodal Demographics and Psychographics for Improving Engagement in Question Answering Communities” (1220367) funded by the Chilean Government.

Data Availability Statement

No new data was created for this work. Accordingly, download links corresponding to the publicly available data used in this research are provided in Section 5.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

cf.	compare
e.g.	for example
i.e.	that is
CD	cardinal number
DT	determiner
EX	existential there
FW	foreign word
IN	preposition or subordinating conjunction
JJ	adjective
JJR	adjective, comparative
JJS	adjective, superlative
LS	list item marker
MD	modal
NN	noun, singular or mass
NNS	noun, plural
NNP	proper noun, singular
NNPS	proper noun, plural
PDT	predeterminer
POS	possessive ending
PRP	personal pronoun
PRP$	possessive pronoun
RB	adverb
RBR	adverb, comparative
RBS	adverb, superlative
RP	particle
SYM	symbol
TO	to
UH	interjection
VB	verb, base form
VBD	verb, past tense
VBG	verb, gerund or present participle
VBN	verb, past participle
VBP	verb, non-third-person singular present
VBZ	verb, third-person singular present
WDT	wh-determiner
WP	wh-pronoun
WP$	possessive wh-pronoun
WRB	wh-adverb

References

Srba, I.; Bielikova, M. A Comprehensive Survey and Classification of Approaches for Community Question Answering. ACM Trans. Web 2016, 10, 18. [Google Scholar] [CrossRef]
Ahmad, A.; Feng, C.; Ge, S.; Yousif, A. A survey on mining stack overflow: Question and answering (Q&A) community. Data Technol. Appl. 2018, 52, 190–247. [Google Scholar]
Ford, D. Recognizing gender differences in stack overflow usage: Applying the Bechdel test. In Proceedings of the 2016 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), Cambridge, UK, 4–8 September 2016; pp. 264–265. [Google Scholar] [CrossRef]
Figueroa, A. Male or female: What traits characterize questions prompted by each gender in community question answering? Expert Syst. Appl. 2017, 90, 405–413. [Google Scholar] [CrossRef]
Sun, J.; Vishnu, A.; Chakrabarti, A.; Siegel, C.; Parthasarathy, S. ColdRoute: Effective routing of cold questions in stack exchange sites. Data Min. Knowl. Discov. 2018, 32, 1339–1367. [Google Scholar] [CrossRef]
Figueroa, A.; Timilsina, M. What identifies different age cohorts in Yahoo! Answers? Knowl.-Based Syst. 2021, 228, 107278. [Google Scholar] [CrossRef]
May, A.; Wachs, J.; Hannák, A. Gender differences in participation and reward on Stack Overflows. Empir. Softw. Eng. 2019, 24, 1997–2019. [Google Scholar] [CrossRef]
Liu, Y.; Tang, A.; Cai, F.; Ren, P.; Sun, Z. Multi-feature based Question–Answerer Model Matching for predicting response time in CQA. Knowl.-Based Syst. 2019, 182, 104794. [Google Scholar] [CrossRef]
Brooke, S. Trouble in programmer’s paradise: Gender-biases in sharing and recognising technical knowledge on Stack Overflow. Inf. Commun. Soc. 2021, 24, 2091–2112. [Google Scholar] [CrossRef]
Schwarzenberg, P.; Figueroa, A. Textual Pre-Trained Models for Gender Identification Across Community Question-Answering Members. IEEE Access 2023, 11, 3983–3995. [Google Scholar] [CrossRef]
Figueroa, A.; Timilsina, M. Textual Pre-Trained Models for Age Screening Across Community Question-Answering. IEEE Access 2024, 12, 30030–30038. [Google Scholar] [CrossRef]
Bouziane, A.; Bouchiha, D.; Doumi, N.; Malki, M. Question Answering Systems: Survey and Trends. Procedia Comput. Sci. 2015, 73, 366–375. [Google Scholar] [CrossRef]
Jose, J.M.; Thomas, J. Finding best answer in community question answering sites: A review. In Proceedings of the 2018 International Conference on Circuits and Systems in Digital Enterprise Technology (ICCSDET), Kottayam, India, 21–22 December 2018; pp. 1–5. [Google Scholar] [CrossRef]
Roy, P.K.; Saumya, S.; Singh, J.P.; Banerjee, S.; Gutub, A. Analysis of community question-answering issues via machine learning and deep learning: State-of-the-art review. CAAI Trans. Intell. Technol. 2023, 8, 95–117. [Google Scholar] [CrossRef]
Kucuktunc, O.; Cambazoglu, B.B.; Weber, I.; Ferhatosmanoglu, H. A Large-scale Sentiment Analysis for Yahoo! Answers. In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, WSDM’12, Seattle, WA, USA, 8–12 February 2012; pp. 633–642. [Google Scholar] [CrossRef]
Morrison, P.; Murphy-Hill, E. Is programming knowledge related to age? An exploration of stack overflow. In Proceedings of the 2013 10th Working Conference on Mining Software Repositories (MSR), San Francisco, CA, USA, 18–19 May 2013; pp. 69–72. [Google Scholar] [CrossRef]
Timilsina, M.; Figueroa, A. Neural age screening on question answering communities. Eng. Appl. Artif. Intell. 2023, 123, 106219. [Google Scholar] [CrossRef]
Figueroa, A.; Peralta, B.; Nicolis, O. Coming to Grips with Age Prediction on Imbalanced Multimodal Community Question Answering Data. Information 2021, 12, 48. [Google Scholar] [CrossRef]
Strauss, B.; Strauss, W.; Howe, N. Generations: The History of America’s Future, 1584 to 2069; William Morrow and Company: New York, NY, USA, 1991. [Google Scholar]
Palomera, D.; Figueroa, A. Leveraging linguistic traits and semi-supervised learning to single out informational content across how-to community question-answering archives. Inf. Sci. 2017, 381, 20–32. [Google Scholar] [CrossRef]
Díaz, O.; Figueroa, A. Improving Question Intent Identification by Exploiting Its Synergy with User Age. IEEE Access 2023, 11, 112044–112059. [Google Scholar] [CrossRef]
Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The Long-Document Transformer. arXiv 2020, arXiv:2004.05150. [Google Scholar]
Lin, B.; Serebrenik, A. Recognizing Gender of Stack Overflow Users. In Proceedings of the 13th International Conference on Mining Software Repositories, MSR ’16, Austin, TX, USA, 14–22 May 2016; pp. 425–429. [Google Scholar] [CrossRef]
Peralta, B.; Figueroa, A.; Nicolis, O.; Trewhela, Á. Gender Identification From Community Question Answering Avatars. IEEE Access 2021, 9, 156701–156716. [Google Scholar] [CrossRef]
Dubois, P.M.J.; Maftouni, M.; Chilana, P.K.; McGrenere, J.; Bunt, A. Gender Differences in Graphic Design Q&As: How Community and Site Characteristics Contribute to Gender Gaps in Answering Questions. Proc. ACM Hum.-Comput. Interact. 2020, 4, 1–26. [Google Scholar]
Ford, D.; Harkins, A.; Parnin, C. Someone like me: How does peer parity influence participation of women on stack overflow? In Proceedings of the 2017 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC), Raleigh, NC, USA, 11–14 October 2017; pp. 239–243. [Google Scholar] [CrossRef]
Wang, Y. Understanding the Reputation Differences between Women and Men on Stack Overflow. In Proceedings of the 2018 25th Asia-Pacific Software Engineering Conference (APSEC), Nara, Japan, 4–7 December 2018; pp. 436–444. [Google Scholar] [CrossRef]
Blanco, G.; Pérez-López, R.; Fdez-Riverola, F.; cia Lourenço, A.M.G. Understanding the social evolution of the Java community in Stack Overflow: A 10-year s tudy of developer interactions. Future Gener. Comput. Syst. 2020, 105, 446–454. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Qiu, X.; Sun, T.; Xu, Y.; Shao, Y.; Dai, N.; Huang, X. Pre-trained Models for Natural Language Processing: A Survey. Sci. China Technol. Sci. 2020, 63, 1872–1897. [Google Scholar] [CrossRef]
Sun, C.; Qiu, X.; Xu, Y.; Huang, X. How to Fine-Tune BERT for Text Classification? arXiv 2019, arXiv:1905.05583. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Radford, A.; Narasimhan, K. Improving Language Understanding by Generative Pre-Training. 2018. Available online: https://www.mikecaptain.com/resources/pdf/GPT-1.pdf (accessed on 9 July 2025).
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
Brauwers, G.; Frasincar, F. A General Survey on Attention Mechanisms in Deep Learning. IEEE Trans. Knowl. Data Eng. 2021, 35, 3279–3298. [Google Scholar] [CrossRef]
Tay, Y.; Dehghani, M.; Bahri, D.; Metzler, D. Efficient transformers: A survey. ACM Comput. Surv. (CSUR) 2020, 55, 1–28. [Google Scholar] [CrossRef]
Han, X.; Zhang, Z.; Ding, N.; Gu, Y.; Liu, X.; Huo, Y.; Qiu, J.; Zhang, L.; Han, W.; Huang, M.; et al. Pre-Trained Models: Past, Present and Future. arXiv 2021, arXiv:2106.07139. [Google Scholar] [CrossRef]
Röttger, P.; Pierrehumbert, J. Temporal Adaptation of BERT and Performance on Downstream Document Classification: Insights from Social Media. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 16–20 November 2021; Moens, M.F., Huang, X., Specia, L., Yih, S.W.t., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2021; pp. 2400–2412. [Google Scholar] [CrossRef]
Clark, K.; Khandelwal, U.; Levy, O.; Manning, C.D. What Does BERT Look at? An Analysis of BERT’s Attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Florence, Italy, 1 August 2019; pp. 276–286. [Google Scholar]
Tenney, I.; Das, D.; Pavlick, E. BERT Rediscovers the Classical NLP Pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 4593–4601. [Google Scholar] [CrossRef]
Lin, Y.; Tan, Y.C.; Frank, R. Open Sesame: Getting inside BERT’s linguistic knowledge. arXiv 2019, arXiv:1906.01698. [Google Scholar]
Rogers, A.; Kovaleva, O.; Rumshisky, A. A primer in bertology: What we know about how bert works. Trans. Assoc. Comput. Linguist. 2020, 8, 842–866. [Google Scholar] [CrossRef]
Jin, D.; Jin, Z.; Zhou, J.T.; Szolovits, P. Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment. Proc. AAAI Conf. Artif. Intell. 2020, 34, 8018–8025. [Google Scholar] [CrossRef]
Talmor, A.; Elazar, Y.; Goldberg, Y.; Berant, J. oLMpics—On what Language Model Pre-training Captures. arXiv 2019, arXiv:1912.13283. [Google Scholar] [CrossRef]
Richardson, K.; Hu, H.; Moss, L.; Sabharwal, A. Probing Natural Language Inference Models through Semantic Fragments. Proc. AAAI Conf. Artif. Intell. 2020, 34, 8713–8721. [Google Scholar] [CrossRef]
Wallat, J.; Singh, J.; Anand, A. BERTnesia: Investigating the capture and forgetting of knowledge in BERT. arXiv 2021, arXiv:2106.02902. [Google Scholar]
Hirao, T.; Yoshida, Y.; Nishino, M.; Yasuda, N.; Nagata, M. Single-Document Summarization as a Tree Knapsack Problem. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; Yarowsky, D., Baldwin, T., Korhonen, A., Livescu, K., Bethard, S., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2013; pp. 1515–1520. [Google Scholar]
Yoshida, Y.; Suzuki, J.; Hirao, T.; Nagata, M. Dependency-based Discourse Parser for Single-Document Summarization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Moschitti, A., Pang, B., Daelemans, W., Eds.; Association for Computational Linguistics: Kerrville, TX, USA, 2014; pp. 1834–1839. [Google Scholar] [CrossRef]
El-Kassas, W.S.; Salama, C.R.; Rafea, A.A.; Mohamed, H.K. Automatic text summarization: A comprehensive survey. Expert Syst. Appl. 2021, 165, 113679. [Google Scholar] [CrossRef]
Taylor, W.L. “Cloze Procedure”: A New Tool for Measuring Readability. J. Mass Commun. Q. 1953, 30, 415–433. [Google Scholar] [CrossRef]
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. In Proceedings of the 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
Clark, K.; Luong, M.T.; Le, Q.V.; Manning, C.D. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. arXiv 2020, arXiv:2003.10555. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Figueroa, A.; Gómez-Pantoja, C.; Neumann, G. Integrating heterogeneous sources for predicting question temporal anchors across Yahoo! Answers. Inf. Fusion 2019, 50, 112–125. [Google Scholar] [CrossRef]
Figueroa, A.; Peralta, B.; Nicolis, O. Gender screening on question-answering communities. Expert Syst. Appl. 2023, 215, 119405. [Google Scholar] [CrossRef]
Baumgartner, J.; Zannettou, S.; Keegan, B.; Squire, M.; Blackburn, J. The Pushshift Reddit Dataset. arXiv 2020, arXiv:cs.SI/2001.08435. [Google Scholar] [CrossRef]
Bagla, K.; Kumar, A.; Gupta, S.; Gupta, A. Noisy Text Data: Achilles’ Heel of popular transformer based NLP models. arXiv 2021, arXiv:2110.03353. [Google Scholar]
Al Sharou, K.; Li, Z.; Specia, L. Towards a Better Understanding of Noise in Natural Language Processing. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), Online, 1–3 September 2021; Mitkov, R., Angelova, G., Eds.; INCOMA Ltd.: Moscow, Russia, 2021; pp. 53–62. [Google Scholar]
Cheng, C.; Yu, X.; Wen, H.; Sun, J.; Yue, G.; Zhang, Y.; Wei, Z. Exploring the Robustness of In-Context Learning with Noisy Labels. In Proceedings of the ICLR 2024 Workshop on Reliable and Responsible Foundation Models, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Yang, Z.; Shou, L.; Gong, M.; Lin, W.; Jiang, D. Model compression with multi-task knowledge distillation for web-scale question answering system. arXiv 2019, arXiv:1904.09636. [Google Scholar]
Bucila, C.; Caruana, R.; Niculescu-Mizil, A. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’06, Philadelphia, PA, USA, 20–23 August 2006; pp. 535–541. [Google Scholar] [CrossRef]

Figure 1. Dependency trees corresponding to three cQA sentences. Each number denotes the depth of the respective token.

Figure 2. Profile sample with some of its respective text abstractions.

Figure 3. Largest 100 frequency diminutions from level nine to three (YA).

Figure 4. Top 100 frequency decreases from depth nine to three (Reddit).

Figure 5. Largest 1000 frequency drops within the 20,000 most recurrent words (YA).

Figure 6. Top 1000 frequency decrements across the 20,000 most seen terms (Reddit).

Figure 7. ROC curves and Confusion Matrices for the best age recognizer on YA and Reddit.

Figure 8. ROC curves and their respective Confusion Matrices for the task of identifying genders across YA and Reddit (best models).

Table 2. Results on YA for age recognition (Macro F1-Score). “X” signals DistillBERT, while “B” represents BERT, “R” represents RoBERTa, “D” represents DistillRoBERTa, and “0” represents root nodes only. Top configurations in bold.

	T	TB	TBA	TBAD	$\bar{x}$	$σ$
0	0.3421(X)	0.4208(X)	0.5232(X)	0.5242(B)	0.4526	0.0882
1	0.5154(B)	0.6220(B)	0.6865(D)	0.6884(D)	0.6281	0.0812
2	0.5708(B)	0.6624(D)	0.7169(D)	0.7155(D)	0.6664	0.0686
3	0.5902(R)	0.6748(D)	0.7231(D)	0.7243(D)	0.6781	0.0630
4	0.5918(B)	0.6807(D)	0.7263(D)	0.7269(D)	0.6814	0.0636
5	0.5947(B)	0.6789(D)	0.7284(D)	0.7288(R)	0.6827	0.0632
6	0.5962(B)	0.6804(D)	0.7282(D)	0.7284(D)	0.6833	0.0623
7	0.5965(B)	0.6839(R)	0.7325(R)	0.7313(D)	0.6861	0.0638
8	0.5961(B)	0.6798(D)	0.7325(R)	0.7304(D)	0.6847	0.0639
9	0.5948(B)	0.6811(D)	0.7304(R)	0.7301(D)	0.6841	0.0639
$\bar{x}$	0.5589	0.6465	0.7028	0.7028
$σ$	0.0802	0.0814	0.0646	0.0641

Table 3. Outcomes in terms of Macro F1-Score on Reddit when identifying age, “X” denotes DistillBERT, while “B” represents BERT, “R” represents RoBERTa, “D” represents DistillRoBERTa, and “0” represents root nodes only. Best systems in bold.

	T	TB	TBA	$\bar{x}$	$σ$
0	0.3397(B)	0.3591(X)	0.4562(D)	0.3850	0.0624
1	0.4180(X)	0.4845(R)	0.5947(D)	0.4990	0.0892
2	0.4471(D)	0.5344(B)	0.6064(D)	0.5293	0.0798
3	0.4575(D)	0.5362(D)	0.6049(D)	0.5329	0.0738
4	0.4608(B)	0.5426(D)	0.6087(D)	0.5373	0.0741
5	0.4545(X)	0.5477(B)	0.6046(D)	0.5356	0.0758
6	0.4619(R)	0.5285(X)	0.6131(D)	0.5345	0.0758
7	0.4602(D)	0.5531(R)	0.6045(D)	0.5393	0.0732
8	0.4605(B)	0.5417(D)	0.6110(D)	0.5378	0.0753
9	0.4554(R)	0.5515(R)	0.6156(R)	0.5408	0.0806
$\bar{x}$	0.4416	0.5179	0.592
$σ$	0.0381	0.0592	0.048

Table 4. Macro F1-Scores for age recognition on SE. “X” stands for DistillBERT, “B” stands for BERT, “R” stands for RoBERTa, “A” stands for ALBERT, “E” stands for ELECTRA, “D” stands for DistillRoBERTa, and “0” stands for profiting exclusively from root nodes. Top models in bold.

	T	TB	TBA	TBAD	$\bar{x}$	$σ$
0	0.2953(A)	0.3076(X)	0.3609(X)	0.3346(X)	0.3246	0.0292
1	0.2976(X)	0.3126(A)	0.3448(D)	0.2707(B)	0.3064	0.0309
2	0.3822(A)	0.3188(B)	0.2984(A)	0.3373(D)	0.3342	0.0357
3	0.3913(A)	0.3157(D)	0.3510(D)	0.3271(A)	0.3463	0.0334
4	0.3841(A)	0.2971(B)	0.3741(E)	0.3190(B)	0.3436	0.0422
5	0.4021(A)	0.2890(E)	0.3031(X)	0.3366(X)	0.3327	0.0504
6	0.4072(A)	0.3305(B)	0.3205(B)	0.2702(E)	0.3321	0.0566
7	0.3179(B)	0.2930(E)	0.3491(D)	0.4108(B)	0.3427	0.0509
8	0.3952(A)	0.2930(D)	0.2915(B)	0.3189(X)	0.3247	0.0487
9	0.3325(D)	0.3652(D)	0.3079(B)	0.3119(A)	0.3294	0.0262
$\bar{x}$	0.3606	0.3123	0.3301	0.3237
$σ$	0.0446	0.0229	0.0293	0.0394

Table 5. Accuracies for gender identification for Yahoo! Answers. “X” stands for DistillBERT, “B” stands for BERT, “R” stands for RoBERTa, “E” stands for ELECTRA, “D” stands for DistillRoBERTa, and “0” stands for capitalizing purely upon root nodes. Best approaches in bold.

	T	TB	TBA	TBAD	$\bar{x}$	$σ$
0	0.6444(X)	0.6777(E)	0.6972(X)	0.6974(X)	0.6792	0.0249
1	0.7469(X)	0.8062(B)	0.8274(D)	0.8293(D)	0.8024	0.0385
2	0.7860(D)	0.8336(D)	0.8484(D)	0.8504(D)	0.8296	0.0300
3	0.7944(X)	0.8391(X)	0.8571(D)	0.8564(D)	0.8367	0.0294
4	0.7973(R)	0.8429(D)	0.8567(D)	0.8578(D)	0.8387	0.0284
5	0.7974(B)	0.8426(D)	0.8572(D)	0.8573(D)	0.8386	0.0283
6	0.7968(X)	0.8418(D)	0.8593(D)	0.8560(D)	0.8385	0.0288
7	0.7977(D)	0.8417(D)	0.8560(D)	0.8585(D)	0.8381	0.0278
8	0.7991(B)	0.8437(D)	0.8569(D)	0.8583(D)	0.8395	0.0277
9	0.7973(X)	0.8448(D)	0.8559(D)	0.8584(D)	0.8391	0.0285
$\bar{x}$	0.7757	0.8214	0.8372	0.8381
$σ$	0.0488	0.0518	0.0501	0.0502

Table 6. Accuracies for gender recognition for Reddit. “R” indicates RoBERTa, “D” stands for DistillRoBERTa, and “0” stands for benefiting solely from root nodes. Top strategies in bold.

	T	TB	TBA	$\bar{x}$	$σ$
0	0.6185(D)	0.6530(D)	0.7373(D)	0.5022	0.0611
1	0.6815(D)	0.7144(D)	0.7735(D)	0.5423	0.0466
2	0.6965(R)	0.7276(D)	0.7794(D)	0.5509	0.0419
3	0.7059(D)	0.7330(R)	0.7731(D)	0.5530	0.0338
4	0.7067(D)	0.7348(D)	0.7727(D)	0.7381	0.0331
5	0.7087(D)	0.7321(D)	0.7741(D)	0.7383	0.0331
6	0.7073(D)	0.7319(D)	0.7745(D)	0.7379	0.0340
7	0.7093(D)	0.7319(D)	0.7750(D)	0.7388	0.0334
8	0.7079(D)	0.7353(R)	0.7779(D)	0.7404	0.0352
9	0.7087(D)	0.7087(D)	0.7738(D)	0.7304	0.0376
$\bar{x}$	0.6951	0.7203	0.7711
$σ$	0.0283	0.0253	0.0121

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Figueroa, A. Less Is More: Analyzing Text Abstraction Levels for Gender and Age Recognition Across Question-Answering Communities. Information 2025, 16, 602. https://doi.org/10.3390/info16070602

AMA Style

Figueroa A. Less Is More: Analyzing Text Abstraction Levels for Gender and Age Recognition Across Question-Answering Communities. Information. 2025; 16(7):602. https://doi.org/10.3390/info16070602

Chicago/Turabian Style

Figueroa, Alejandro. 2025. "Less Is More: Analyzing Text Abstraction Levels for Gender and Age Recognition Across Question-Answering Communities" Information 16, no. 7: 602. https://doi.org/10.3390/info16070602

APA Style

Figueroa, A. (2025). Less Is More: Analyzing Text Abstraction Levels for Gender and Age Recognition Across Question-Answering Communities. Information, 16(7), 602. https://doi.org/10.3390/info16070602

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Less Is More: Analyzing Text Abstraction Levels for Gender and Age Recognition Across Question-Answering Communities

Abstract

1. Introduction

2. Related Work

2.1. Age Screening

2.2. Gender Recognition

2.3. Pre-Trained Encoders

3. Research Questions

4. Methodology

5. Experiments

5.1. Age Prediction

5.2. Gender Recognition

5.3. Gender vs. Age Recognition Best Models

5.4. Caveats

6. Limitations and Future Research

7. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI