Next Article in Journal
Skin Lesion Classification Using Hybrid Feature Extraction Based on Classical and Deep Learning Methods
Previous Article in Journal
AI-Driven Bayesian Deep Learning for Lung Cancer Prediction: Precision Decision Support in Big Data Health Informatics
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Identifying Themes in Social Media Discussions of Eating Disorders: A Quantitative Analysis of How Meaningful Guidance and Examples Improve LLM Classification

1
Computer Science, College of Engineering & Applied Science, University of Wisconsin-Milwaukee, Milwaukee, WI 53211, USA
2
Health Informatics, Joseph J. Zilber College of Public Health, University of Wisconsin-Milwaukee, Milwaukee, WI 53211, USA
3
Information Technology Management, Lubar College of Business, University of Wisconsin-Milwaukee, Milwaukee, WI 53211, USA
*
Author to whom correspondence should be addressed.
BioMedInformatics 2025, 5(3), 40; https://doi.org/10.3390/biomedinformatics5030040
Submission received: 14 June 2025 / Revised: 7 July 2025 / Accepted: 10 July 2025 / Published: 11 July 2025

Abstract

Background: Social media represents a unique opportunity to investigate the perspectives of people with eating disorders at scale. One forum alone, r/EatingDisorders, now has 113,000 members worldwide. In less than a day, where a manual analysis might sample a few dozen items, automatic classification using large language models (LLMs) can analyze thousands of posts. Methods: Here, we compare multiple strategies for invoking an LLM, including ones that include examples (few-shot) and annotation guidelines, to classify eating disorder content across 14 predefined themes using Llama3.1:8b on 6850 social media posts. In addition to standard metrics, we calculate four novel dimensions of classification quality: a Category Divergence Index, confidence scores (overall model certainty), focus scores (a measure of decisiveness for selected subsets of themes), and dominance scores (primary theme identification strength). Results: By every measure, invoking an LLM without extensive guidance and examples (zero-shot) is insufficient. Zero-shot had worse mean category divergence (7.17 versus 3.17). Whereas, few-shot yielded higher mean confidence, 0.42 versus 0.27, and higher mean dominance, 0.81 versus 0.46. Overall, a few-shot approach improved quality measures across nearly 90% of predictions. Conclusions: These findings suggest that LLMs, if invoked with expert instructions and helpful examples, can provide instantaneous high-quality annotation, enabling automated mental health content moderation systems or future clinical research.

1. Introduction

Eating disorders, historically viewed as primarily affecting Western societies, are recognized today as a global issue despite challenges in accurately quantifying their prevalence worldwide. Nationally representative data remains sparse, while difficulties in collecting broader data include stigma and changes in diagnostic criteria over time. The most recent Global Burden of Disease study estimated approximately 13.9 million cases of Anorexia or Bulimia globally in 2019. Additionally, the study highlighted an overlooked prevalence of 41.9 million cases of Other Specified Feeding and Eating Disorders (OSFEDs) and binge eating disorder, indicating a total global prevalence of approximately 0.7%. However, due to underreporting and many cases not seeking formal healthcare services, the true prevalence is believed to be significantly higher than these estimates [1].
In this paper, we consider a retrospective analysis of text written by people participating in online social media forums that specialize in helping people affected by eating disorders. For researchers, analyzing social media text offers a unique opportunity to draw insights from people in their natural environment, interacting with peers in relative anonymity, potentially freeing them to expose more of their complex, true selves. While people might use social media for healing, for example, they can also use it to re-enforce false or unhealthy ideas. Even in forums dedicated to recovery, one finds text that spans multiple complex, interconnected themes—a single statement might simultaneously address body image concerns, negative emotions, and disordered behaviors. This complexity is not surprising. While body image plays a pivotal role in the development and exacerbation of eating disorders, it encompasses complex psychological aspects beyond mere physical appearance, involving perceptions of oneself through a lens of dysmorphia and a pervasive desire to “fix” perceived flaws. Recent reports indicate alarming trends among young people, with a substantial percentage expressing dissatisfaction with their bodies and engaging in disordered eating behaviors by early adulthood [1].
Social media has emerged as a significant factor influencing body image and eating behaviors among youth. The ease of access to content promoting harmful eating disorder behaviors, coupled with algorithms that personalize and amplify such content, has exacerbated these issues. Trends promoting extreme fitness regimes or unrealistic body ideals further contribute to the normalization of unhealthy behaviors [1]. While strides have been made in understanding and addressing eating disorders globally, significant challenges remain in accurately capturing their full scope and impact. Continued research and awareness efforts are crucial to better inform interventions and support systems for those affected by these complex disorders, particularly among vulnerable populations such as adolescents and young adults [1].
Because eating disorders (EDs) often lead individuals to be guarded about their experiences, many may not readily discuss their disorder, which can hinder researchers and clinicians from fully grasping the factors contributing to ED symptoms. However, a significant number of individuals with EDs utilize social media platforms to engage in candid discussions about their experiences with others who share similar challenges [2]. These discussions can reveal unbiased insights into the thoughts, emotions, and behaviors of individuals affected by EDs. This information is vital for identifying suitable treatment goals and designing interventions that are more likely to be effective [2].
Recent advances in large language models (LLMs) have shown significant promise for natural language processing applications, particularly in health-related text classification tasks. Prior research has systematically evaluated LLM performance across multiple social media health classification tasks, comparing traditional machine learning approaches with modern transformer-based models and LLM-based methodologies [3]. The LLM methodologies may explore the impacts of providing examples (zero-shot versus few shot) as part of the instructions (known as the “prompt”), retraining the model with many labeled examples (known as “fine-tuning”), or varying the provided instructions (e.g., to prevent errors observed during early testing).
Guo et al. [3] examined three distinct LLM utilization strategies: direct zero-shot classification, LLM-assisted data annotation, and few-shot data augmentation approaches. The findings revealed nuanced performance patterns across these methodologies. While LLM-annotated training data alone proved insufficient for supervised model training, zero-shot LLM classifiers demonstrated superior performance compared to traditional support vector machines and achieved higher recall rates than advanced transformer models like RoBERTa [3]. Notably, data augmentation strategies showed model-dependent effectiveness, with GPT-4 augmentation improving performance, while GPT-3.5 augmentation potentially degraded model quality.
This study assesses the effectiveness of using automated methods for thematic content analysis using generative large language models that have been provided with a human-understandable annotation guideline. To provide a thorough analysis of the differences between alternative prompting approaches, we introduce several new metrics, specific to classification tasks with multiple, potentially overlapping categories, to go beyond simple accuracy. These metrics capture how much the two strategies vary in the sets of themes they identify, the degree in confidence for individual themes within the selected sets, and their choice of the most prominent category for each post. The outcomes include an optimal strategy for describing the coding task as a “prompt”. We also show the quantitative transformation in how well models organize conceptual information when provided with examples, along with specific guidance for classification, especially borderline cases. We compare zero-shot and few-shot classification across multiple dimensions, including category breadth, confidence distribution, and conceptual focus. The new evaluation measures are Category Divergence Index (CDI), Top Category Confidence, Focus Score and Dominance Ratio, are described below, with specific calculations provided in the Methods.
  • Category Divergence Index (CDI) measures the degree of disagreement between zero-shot and few-shot approaches when classifying the same text content into thematic categories.
  • Top Category Confidence is the estimated probability of the most prominent category for each post, indicating how certain the model is about its primary classification choice.
  • Focus Score is an entropy-based measure that captures how concentrated or scattered the confidence scores are across all assigned categories. Values closer to zero indicate more concentrated confidence in fewer categories.
  • Dominance Ratio compares the strength of the primary category against all secondary categories, calculated as the ratio between the top category’s confidence and the combined confidence of remaining categories.
By developing and applying a multi-centric evaluation framework, we quantify the comparative effectiveness of using large language models for text classification tasks with and without examples that have been manually coded by experts.

2. Materials and Methods

2.1. Reddit Data Collection

The data for this study utilized content from r/EatingDisorders, a subreddit community on Reddit, created in 2008, with an estimated 113,000 members. This community describes itself as a recovery-oriented support network for individuals affected by eating disorders, whether personally experiencing these conditions, supporting loved ones, or seeking information. Using the Python Reddit API Wrapper (PRAW), version 7.8.1, we extracted 6950 posts and associated comments from the subreddit (100 to be manually labeled and the rest for use in the main part of the study). PRAW facilitates programmatic access to Reddit’s API, allowing for systematic data collection while adhering to the platform’s usage policies. The collection procedure focused on content sorted by Reddit’s “Hot” algorithm, which prioritizes recently active posts receiving substantial engagement through upvotes and comments.
The complete dataset of 6950 Reddit posts and comments was collected over a 25-month period from April 2023 to May 2025. The dataset represents content from 4831 unique users, including 3557 unique commenters and 2039 unique post authors. The temporal distribution shows 3263 unique posts across the study period: 872 posts in 2023, 1485 posts in 2024, and 906 posts in 2025, with an average of 278 comments per month.
Geographic analysis using spaCy natural language processing identified location references within the text: North America (273 mentions), Europe (128 mentions), Asia (15 mentions), and Australia (15 mentions). These represent textual mentions rather than verified user locations, as Reddit’s API does not provide user geographic data.
All data collection procedures were designed to respect user privacy and adhere to Reddit’s terms of service. While the subreddit is publicly accessible, steps were taken to protect user privacy during analysis. Usernames were used solely for calculating aggregate statistics (such as unique user counts) and were subsequently removed from the analytical dataset, with all results reported in aggregate form.
Prior to analysis, the collected posts underwent preprocessing to remove irrelevant content such as automated moderator comments, and deleted posts. Technical artifacts—elements that are not part of the natural discourse but result from the digital platform’s infrastructure—were removed to ensure analytical clarity. These artifacts included the following:
  • Markdown formatting symbols (e.g., asterisks, hashtags, and backticks);
  • HTML entities and escape sequences;
  • Special characters and encoding inconsistencies;
  • Extraneous whitespace and line break patterns.
The resulting dataset comprised textual content suitable for natural language processing techniques, preserving the authentic language used in eating disorder discussions while eliminating these digital infrastructure elements. The next step involved creating an annotation guideline that would be used both to hand code a sample of data (100 items) and to include as part of the prompt to a large language model for automatic annotation of an additional 6850 items.

2.2. Methods

2.2.1. Overview of Methods

Our study tests using large language models (LLMs) to annotate the dataset using zero-shot and few-shot learning techniques. The approach requires a concise description of specific annotation instructions (see Figure 1) and the criteria to use for determining whether a theme is in evidence (see Figure 2). Then we iteratively developed prompting strategies to induce the LLM to produce the most accurate annotations (and suppress observed inclinations to draw unfounded inferences which may have been previously introduced during refinement of the foundation models). The final step is to perform a thorough analysis of the dataset and classification results based on the results of the annotation and classification metrics introduced above. We provide additional information about each of these subtasks below.

2.2.2. Developing Annotation Guidelines of Eating Disorder Themes

Following Zhou et al. (2020) [2], we took ten core themes that form the foundation for our annotation framework: Weight, Eating Disorder Symptoms and Behaviors, Food/Drink/Nutrition, Body Image, Social Media/Advertising/Portrayals, Mental Disorder, Negative Emotions, Negative Consequences, and Recovery and Treatment.
We also expanded upon Zhou et al.’s computationally derived themes [2] by incorporating four additional themes, Supplements, Negative Social Reactions, and Relationships and Advice/Reflection/Planning, based on clinical perspectives from the eating disorder literature [4,5,6].
Each theme was refined with specific keywords and behavioral indicators that align with diagnostic criteria and clinical observations. For instance, the “Eating Disorder Symptoms and Behaviors” theme encompasses both DSM-5 diagnostic behaviors (binge eating, purging, restriction) and associated physical symptoms (dizziness, fatigue, nausea), reflecting the biopsychosocial model of eating disorders. This approach allows the classifier to address that eating disorder discourse often involves interconnected themes—and provide multiple codes for a text that simultaneously includes mentions of body image concerns, negative emotions, and disordered behaviors.

2.2.3. Zero-Shot and Few-Shot Prompt

This study employed two distinct prompting strategies to investigate the effectiveness of large language models (LLMs) in classifying eating disorder-related content: zero-shot and few-shot prompting. The implementation of these approaches required carefully designed prompting templates that differ in style and informational content.
Zero-Shot Prompting Template: The zero-shot approach utilized a minimalist prompting structure that relied solely on the model’s pre-trained knowledge.
The Zero-Shot template (see Table 1) consists of the following:
  • Role Definition: A brief statement identifying the model as “an expert in classifying posts about eating disorders according to specific themes”.
  • Theme List: The 14 predefined theme categories presented as a numbered list.
  • Target Post: The post to be classified.
  • Output Instructions: Simple formatting requirements specifying score assignment (0.0–1.0).
The zero-shot prompt averaged approximately 300–500 words. This approach tested whether LLMs could perform theme classification based purely on their intent understanding of eating disorder concepts.
Few-Shot Prompting Template with Guidelines: The few-shot approach implemented a more extensive prompting structure (see Table 2). This structure included a variation of the zero-shot prompt along with our definition for each theme and guidance for discriminating difficult cases and for avoiding going beyond what is explicit:
  • Role Definition: Similar expert framing as zero-shot.
  • Annotation Guidelines: Adds full theme definitions with keywords and contextual information (~2000 words).
  • Additional Disambiguation Notes: Detailed guidance on distinguishing between overlapping themes (~2000 words).
  • Theme List: The same 14 categories with emphasis on exact meaning (as in zero-shot).
  • Example Demonstration: Adds up to 100 randomly selected annotated examples showing post-classification pairs.
  • Target Post: The post to be classified as zero-shot.
  • Enhanced Instructions: Original instructions with additional explicit rules against inferring implicit themes and encouraging emphasis on textual evidence.
This template could reach 20,000–25,000 words depending on the number of examples included, representing a substantial increase in context and guidance compared to the zero-shot approach. Table 3 summarizes the main differences between our zero-shot and few-shot prompting strategies, where the latter provides both examples and instructions to avoid common mistakes observed in either human or LLM-based annotations.
Our thematic analysis of 100 manually annotated Reddit posts and comments identified 13 of the 14 predefined themes, achieving 92.9% domain coverage of our framework. These 100 posts were randomly selected from our larger dataset of 6.950 posts and served as the foundation for developing our annotation guidelines and training examples for the few-shot learning approach. Only “Supplements” was absent from this annotated subset, reflecting its specialized nature within eating disorder discourse.
Core Eating Disorder Manifestations (31.7%): This domain encompasses the fundamental behavioral and physical aspects of eating disorders through Eating Disorder Symptoms and Behaviors (13.8%), Weight (7.7%), Body Image (2.5%) and Food/Drink or Nutrition (7.7%). These themes collectively capture the most direct and observable features of eating disorders, representing the primary ways individuals experience and express their relationship with food, eating behaviors, body weight and appearance related aspects in daily life.
Support and Healthcare Systems (28.3%): This domain encompasses Advice/Reflection or Planning (22.5%) and Treatment (5.8%). The predominance of peer support activities demonstrates Reddit’s primary function as a community-driven support platform, while treatment discussions represent formal healthcare engagement, indicating users’ dual reliance on both professional and peer support system.
Psychological, Physical and Emotional Dimensions (22%): This category includes Recovery (12.5%), Negative Emotions (4%), Negative Consequences (4.1%), and Mental Disorder (1.4%). Recovery discussions dominated this domain. Negative Consequences captures discussions of long-term physical and health impacts resulting from eating disorders, including medical complications, organ damage, and physical deterioration.
Social and Interpersonal Context (18%): This domain comprises Relationships (9.7%), Social media/Advertising or Portrayals (5.6%), and Negative Social Reactions (2.7%). This grouping demonstrates how eating disorders exist within broader social contexts, encompassing both supportive and harmful interpersonal dynamics alongside external cultural pressures from media representation and societal beauty standards.
Based on this analysis, the annotation guideline was created and labeled examples were selected to fill the template items for the condition “Few-shot prompting with guidelines.”

2.2.4. Analytical Methods

Large language models exhibit well-documented inconsistencies in response generation when subject to varying prompt structures [7]. To address this challenge, we introduce a novel measurement approach that quantifies the relationship between contextual richness (through the inclusion of annotated examples, detailed guidelines, and disambiguation instructions) and model’s response reliability. This metric framework allows us to systematically compare prompting methodologies and demonstrate that comprehensive context provision leads to enhanced classification consistency while mitigating spurious theme assignments—a form of hallucination where models incorrectly identify multiple themes not explicitly supported by textual evidence. Below we specify the calculations for each metric.
  • Category Divergence Index (CDI)
The Category Divergence Index calculation uses Jaccard similarity to quantify classification inconsistency between approaches:
CDI = 1 − J(Z, F)
where J(Z, F) represents the Jaccard similarity coefficient between categories predicted by zero-shot (Z) and few-shot (F) approaches. The Jaccard similarity is calculated as
J(Z, F) = |Z ∩ F|/|Z ∪ F|
where |Z ∩ F| represents shared categories and |Z ∪ F| represents total unique categories. This transformation yields a divergence measure where 0 indicates perfect agreement and 1 indicates complete disagreement.
  • Top Category Concordance Analysis
Following our investigation of overall category divergence, we examined the extent to which the zero-shot and few-shot approaches agree on the most salient category for each post. This analysis addresses a fundamental question: Do different prompting strategies consistently identify the same primary theme in identical content?
The calculation of top category concordance uses a three-step systematic process as follows:
Step 1. Primary Category Identification: For each post, we extracted the category with the highest confidence score from both zero-shot and few-shot classification results.
Step 2. Binary Concordance Determination: We established a binary concordance indicator (match/no match) based on whether the top-ranked categories from both approaches were identical.
Step 3. Concordance Rate Calculation: We computed the overall concordance rate as the percentage of posts where both approaches identified the same top category.
This methodology isolates the most confident classification decision from each approach, enabling assessment of agreement on the primary thematic element regardless of differences in secondary categories.
  • Confidence Distribution Metrics
Building upon our findings of significant divergence in category count and primary theme identification, we conducted a comprehensive analysis of confidence patterns and distributional properties to elucidate fundamental differences in classification behavior between the zero-shot and few-shot approaches.
We systematically examined three distinct but complementary dimensions of classification confidence:
Top Category Confidence: The confidence score assigned to the highest-ranked category for each post, representing the model’s certainty about its primary classification decision.
Focus Score: A distributional metric based on negative entropy, quantifying the concentration or dispersion of confidence scores across assigned categories. Higher values (less negative) indicate more focused confidence distribution.
Dominance Ratio: The ratio of the top category’s confidence score to the sum of all other category scores, measuring the relative prominence of the primary category compared to secondary categories.

2.2.5. Statistical Analysis

We conducted paired-samples t-tests to compare the zero-shot and few-shot approaches across all metrics. For each test, we verified assumptions of normality using Shapiro–Wilk tests and examined distribution characteristics. Effect sizes were calculated using Cohen’s d. All analyses were performed using Python 3.13 with SciPy 1.15.3. We used statistical significance level of p < 0.001 due to the large sample size and additionally reported practical significance by quantifying percentage differences between approaches to provide context for interpreting statistical results.

3. Results

The analysis was conducted on 6850 posts, with 4458 posts (65%) exhibiting CDI values greater than 0.5 (See Figure 3). The paired t-test yielded the following results: t (4457) = 86.21, p < 0.001, with mean categories for zero-shot (=7.17) significantly higher than for few-shot (=3.17), representing a mean difference of 4 categories. (In the plots, the color difference is just aesthetic; the circles above the box for Few-shot represent the “outliers”).

3.1. Classification Concordence

The analysis of Classification Concordance (see Table 4) revealed a concordance rate of 30.78% (1372 out of 4458 posts), with a corresponding discordance rate of 69.22% (3086 posts). This indicates that, in more than two-thirds of cases, zero-shot and few-shot approaches identified different primary themes in the same content.

3.2. Confidence and Distributional Properties Analysis

Confidence metrics were calculated separately for zero-shot and few-shot classification results across all 4458 posts exhibiting high category divergence. The analysis revealed systematic differences across all confidence metrics, as shown in Table 5.
The item-level distribution of confidence scores, organized by range and categorized as Low, Medium, or High, is shown in Table 6. The results indicate more frequent confidence for the few-shot approach, as the amount of classificatory ambiguity has been reduced.
The values for Focus scores are shown in Table 7. In 89.6% of cases, the few-shot approach provided higher focus, with an average improvement of 38.6%.
Table 8 shows the results of the distribution analysis for each Focus category. Predictions in the highly uncertain ranges (−3.00 to −2.10) decreased from 29.4% to just 0.8%—a reduction of 28.6% percentage points. The moderately diffuse range (−2.10 to −1.50) similarly decreased by 31.7% percentage points, from 43.3% to 11.6%. Conversely, few-shot prompting increased focused classifications. The moderately focused range (−1.50 to −0.90) increased by 38.2 percentage points, growing from 19.3% to 57.5% of all predictions. Highly focused classifications (−0.90 to 0.00) increased from 8.0% to 30.2%, representing a 22.2 percentage point improvement.
Dominance scores are provided in Table 9, along with their distribution by item, in Table 10. The mean dominance score increased by 76.1%, from 0.46 to 0.81, indicating that few-shot examples substantially improve the model’s ability to identify primary themes. This improvement was evident in 86.3% of individual predictions, with only 9.6% showing decreased dominance under few-shot prompting. Under zero-shot prompting, 78.5% of predictions had dominance scores below 0.5, meaning the primary theme received less confidence than all secondary themes combined—indicating highly ambiguous classifications. Few-shot prompting reduced this proportion to 30.9%, representing a 47.6% percentage point improvement in classification clarity. Classifications where the primary theme approached the combined weight of secondary themes (0.5–1.0 range) increased from 12.8% to 38.0%, representing a 25.2 percentage point improvement.

3.3. High-Confidence Error Detection

To identify potential classification errors, we implemented a systematic high-confidence error detection framework targeting cases where zero-shot and few-shot approaches exhibited overconfident predictions that were likely incorrect. Using a confidence threshold of 0.5, we defined four categories of high-confidence errors:
False Positive Indicators:
High Confidence + High Divergence: Cases where zero-shot predictions exceeded 0.5 confidence but exhibited substantial inter-model disagreement (Category Divergence Index ≥ 0.5), suggesting overconfident misclassification.
High Confidence + High Focus + Category Disagreement: Cases with zero-shot confidence > 0.5, focus scores approaching zero (focus >= 0.3), but disagreement on primary category assignment between zero-shot and few-shot prompts.
False Negative Indicators:
High Confidence + Low Theme Count: Cases where zero-shot expressed confidence > 0.5 while identifying only one theme, potentially missing co-occurring themes characteristic of complex eating disorder discussions.
High Confidence + High Dominance: Cases where in zero-shot a theme achieved overwhelming dominance (≥2.0) over others despite high confidence, suggesting systematic under-detection of secondary relevant themes.
Analysis of the filtered dataset revealed 63 high-confidence error candidates (1.4% of total 4458 posts), with the following distribution:
False Positive Errors (36 cases, 57% of errors):
  • High Confidence + High Divergence: 27 cases (43%).
  • High Confidence + High Focus + Disagreement: 9 cases (14%).
False Negative Errors (27 cases, 43% of errors):
  • High Confidence + High Dominance: 17 cases (27%).
  • High Confidence + Low Theme Count: 10 cases (16%).

4. Discussion

Social media platforms offer several methodological advantages: they provide access to naturalistic discourse where individuals express unfiltered thoughts and experiences in their own language, capture real-time community dynamics and peer interactions, and reach populations who may not engage with formal healthcare systems or structured research studies. However, our methodological approach introduces specific limitations that must be acknowledged. The use of Reddit’s “hot” filtering algorithm creates systematic bias toward popular, highly engaged content rather than representative sampling of all posts [8]. This algorithmic selection may overrepresent dramatic, controversial, or highly relatable experiences while sampling typical or mundane posts.
We implemented several strategies to mitigate potential biases inherent in social media analysis. To address population bias, we acknowledge that Reddit users represent a specific demographic subset, and our findings may not generalize to all individuals with eating disorders. The geographic distribution of location mentions suggests potential overrepresentation of Western perspectives, which we account for in our interpretation of results.
The semi-anonymous nature of Reddit provides certain methodological advantages for sensitive health research. Users operating under pseudonyms may share more honest experiences due to reduced social desirability bias compared to identified research participants. The anonymous structure also reduces motivation for deliberate deception, as users have limited incentive to misrepresent their experiences. Furthermore, the community-driven nature of Reddit discussions provides natural authenticity validation through peer responses and engagement patterns.
Regarding geographic limitations, we acknowledge that location references extracted through natural language processing represented textual mentions rather than verified user locations. Reddit’s API does not provide geographic user data, and we cannot definitely establish the actual geographic distribution of our sample or attribute specific cultural contexts to user experiences.
The multi-thematic coding system was specifically designed to capture the inherent complexity and interconnectedness of eating disorder discourse online. Unlike single-label classification systems, this approach recognizes that eating disorder discussions rarely fit into non-overlapping categories. Instead, they often weave together multiple psychological, behavioral, and social dimensions within a single statement. Our multi-thematic approach addresses the clinical reality that eating disorders are complex mental health conditions involving cognitive, emotional, behavioral and interpersonal components that cannot be adequately captured through mutually exclusive categories.
The system’s design acknowledges several key characteristics of eating disorder online discourse:
  • Layered Communication: Online posts about eating disorders often contain multiple layers of meaning. For example, a statement like “I feel disgusting after gaining 5 lbs. I need to start restricting again” simultaneously addresses weight concerns, body image disturbance, negative emotions, and disordered eating behaviors. The multi-thematic approach allows for the coding of all relevant dimensions: [Weight, Eating Disorder Symptoms, Negative Emotions].
  • Contextual Interdependence: Themes in eating disorder discourse are often contextually dependent. The system accounts for how the same phrase can indicate different themes based on context. For instance, “scared” might indicate Eating Disorder Symptoms when linked to food consumption (“too scared to eat”) but Negative Emotions when expressing general anxiety about recovery.
  • Temporal Complexity: Eating disorder discussions frequently involve past experiences, present struggles, and future concerns within the same post. The multi-thematic system can capture this temporal complexity, such as when individuals discuss past treatment while planning recovery strategies: [Treatment, Recovery, Advice/Reflection/Planning].
The coding system employs several design features to ensure comprehensive capture of eating disorder discourse:
  • Hierarchical Theme Structure: Each primary theme contains specific keywords and behavioral indicators that guide coding decisions. This hierarchical structure allows for both broad thematic categorization and granular content analysis. For example, the “Eating Disorder Symptoms and Behaviors” theme encompasses
    • Behavioral manifestations (binge eating, purging, restricting);
    • Physical symptoms (dizziness, fatigue, nausea);
    • Emotional precursors and consequences.
  • Boundary Definition Guidelines: The additional notes provide explicit guidance for distinguishing between overlapping themes, addressing common ambiguities in coding. These guidelines establish decision rules for boundary cases, such as
    • Distinguishing emotional distress directly linked to food consumption (Eating Disorder Symptoms and Behaviors) from general negative emotions.
    • Differentiating between personal recovery experiences and advice-giving behaviors.
    • Separate relationship dynamics from broader negative social reactions.
  • Inclusive Coding Principles: The system adopts an inclusive rather than exclusive approach to thematic assignment. Coders are instructed to apply all relevant themes rather than forcing content into the single “best” category. This principle ensures that the full complexity of eating disorder discourse is captured, reflecting the multifaceted nature of these conditions.
The mean confidence score increased from 0.266 to 0.424, representing a 59.4% improvement. This improvement was consistent across central tendency measures, with the median confidence increasing by 74.5% from 0.235 to 0.410. Notably, the standard deviation remained relatively stable (0.147 vs. 0.153), indicating that the confidence improvement was systematic rather than driven by increased variability.
Direct comparison of confidence scores for identical posts revealed that 87.26% of predictions showed higher confidence with few-shot prompting, while only 10.66% showed decreased confidence. The remaining 2.09% of predictions maintained identical confidence scores across both methods. This near-universal improvement in confidence suggests that few-shot examples provide systematic benefits rather than random improvements.
The substantial increase in model confidence (Table 6) suggests that few-shot examples serve a critical role beyond simple performance improvement—they appear to reduce classification ambiguity. The dramatic reduction in very low confidence scores (0.1–0.3 range) indicates that the few-shot experiences significantly less uncertainty when provided with concrete examples of theme classification. While higher confidence scores are generally desirable, it is important to distinguish between well-calibrated confidence and overconfidence. The concentration of few-shot predictions in the medium-confidence range (0.3–0.6) rather than exclusively high-confidence predictions suggests that the model is making more nuanced certainty assessments rather than simply becoming overconfident. The relatively modest increase in very high confidence predictions (0.6–1.0) supports this interpretation.
Few-shot prompting demonstrated markedly improved focus compared to zero-shot prompting (Table 7). This means focus score increased from −1.715 to −1.053, representing a 38.6% improvement toward more focused classifications. This improvement was accompanied by reduced variability, with the standard deviation decreasing by 32.7% from 0.563 to 0.379, indicating more consistent decisiveness across predictions.
The improvement in focus was nearly universal, with 89.61% of individual posts showing higher focus scores under few-shot prompting compared to zero-shot. Only 9.33% of posts showed decreased focus, while 1.05% maintained identical scores across methods.
The distribution analysis revealed a fundamental shift in classification patterns (Table 8). Few-shot prompting substantially reduced diffuse classifications, with predictions in the highly uncertain ranges (−3.00 to −2.10) decreasing from 29.4% to just 0.8%—a reduction of 28.6% percentage points. The moderately diffuse range (−2.10 to −1.50) similarly decreased by 31.7% percentage points, from 43.3% to 11.6%. Conversely, few-shot prompting dramatically increased focused classifications. The moderately focused range (−1.50 to −0.90) increased by 38.2 percentage points, growing from 19.3% to 57.5% of all predictions. Highly focused classifications (−0.90 to 0.00) increased from 8.0% to 30.2%, representing a 22.2 percentage point improvement.
To assess how effectively each prompting method identifies primary themes relative to secondary themes, we calculated dominance scores as the ratio of the highest-confidence theme to the sum of all other theme confidences. This metric quantified whether the model identifies a clear primary theme (dominance > 1.0) or distributes confidence more evenly across multiple themes (dominance < 1.0).
Few-shot prompting demonstrated markedly superior primary theme identification compared to zero-shot prompting (Table 9). The mean dominance score increased by 76.1%, from 0.46 to 0.81, indicating that few-shot examples substantially improve the model’s ability to identify primary themes. This improvement was evident in 86.3% of individual predictions, with only 9.6% showing decreased dominance under few-shot prompting.
The most striking finding was the substantial reduction in weak primary theme classifications (Table 10). Under zero-shot prompting, 78.5% of predictions had dominance scores below 0.5, meaning the primary theme received less confidence than all secondary themes combined—indicating highly ambiguous classifications. Few-shot prompting reduced this proportion to 30.9%, representing a 47.6% percentage point improvement in classification clarity.
Conversely, few-shot prompting substantially increased moderate primary theme identification. Classifications where the primary theme approached the combined weight of secondary themes (0.5–1.0 range) increased from 12.8% to 38.0%, representing a 25.2 percentage point improvement. This shift indicates that few-shot examples help the model develop clearer theme preferences without necessarily creating overconfident classifications.
Few-shot prompting also improved the model’s ability to make clearly decisive classifications. Predictions where the primary theme exceeded all secondary themes combined (dominance > 1.0) increased from 8.7% to 31.1%—a 22.4 percentage point improvement (Table 10). Within this category, both clear primary classifications (1.0–1.5 range) and strong dominance classifications (>1.5) showed substantial improvements, increasing by 15.0 and 7.3 percentage points, respectively.
The proportion of classifications with strong dominance (>1.5) tripled from 3.7% to 11.1%, indicating that few-shot prompting enables more confident primary theme identification when clear evidence exists in the text.
The error analysis revealed several key patterns. False positive errors were more prevalent than false negatives (57% vs. 43%), indicating a tendency toward over assignment rather than systematic omission of themes. The most frequent error pattern was high-confidence predictions with substantial inter-model divergence (43% of all errors), suggesting systematic disagreements between zero-shot and few-shot approaches in specific contexts.
The dominance-based false negative pattern (27% of errors) highlights potential tunnel vision effects where zero-shot confidently focuses on single themes while missing relevant co-occurring patterns. These findings underscore the importance of confidence calibration and multi-model validation in automated content analysis for sensitive clinical domains.
Our findings contribute to the growing evidence that few-shot prompting provides advantages for specialized domain classification tasks. The eating disorder theme classification task represents a challenging scenario with substantial inter-theme overlap, nuanced language patterns, and high-stakes classification decisions. The consistent improvements across confidence, focus, and dominance metrics suggest that few-shot examples provide domain-specific guidance that cannot be easily acquired through general pre-training alone.
The universality of improvements (87.3% of predictions showed higher confidence, 89.6% showed improved focus, 86.3% showed enhanced dominance) indicates systematic enhancement rather than selective benefits for some content types. This pattern suggests that few-shot prompting may be particularly valuable for other mental health and medical classification tasks where domain expertise is critical for accurate interpretation.

4.1. Limitations and Future Research Directions

  • Validation Requirements: While our confidence, focus, and dominance metrics provide compelling evidence for improved classification quality, they represent model self-assessments rather than objective accuracy measures. Future research should examine the relationship between these metrics and ground truth accuracy using expert-annotated datasets. The correlation between confidence scores and actual classification accuracy is particularly important for calibrating automated decision thresholds.
  • Generalizability and Robustness: The current study examines a single model (Llama 3.1:8b) and domain (eating disorder content). Replication across different language models and mental health domains is essential to establish the generalizability of these findings. Additionally, robustness testing with adversarial examples and edge cases should evaluate whether the observed improvements persist under challenging conditions.
    The selection and quality of few-shot examples likely influence performance substantially, but systematic investigation of example curation strategies remains an important research direction. Understanding how example diversity, complexity, and domain specificity affect classification quality could inform best practices for few-shot prompt design.
  • Ethical and Safety Considerations: The improved classificatory capabilities must be balanced against potential risks of increased automation in mental health contexts. While higher confidence and dominance scores suggest more reliable classification, they do not eliminate the possibility of systematic biases or misclassifications. Automated systems should maintain appropriate human oversight, particularly high-risk content categories.
    The concentration of improvements in moderate confidence ranges suggests appropriate calibration, but monitoring for over-confidence bias in production deployments remains essential. Regular auditing and bias testing should be implemented to ensure that enhanced classificatory confidence does not mask problematic systematic errors.

4.2. Broader Implications for AI in Mental Health

Our findings contribute to the broader computational mental health research agenda by demonstrating that sophisticated prompt engineering can substantially enhance the reliability of automated content analysis. The multi-dimensional improvement across confidence, decisiveness, and hierarchical reasoning suggests that few-shot prompting may be valuable for other mental health NLP tasks, including sentiment analysis, risk assessment, and therapeutic response generation.
The information-theoretic foundations of our metrics (entropy-based focus scores, ratio-based dominance scores) provide principled approaches for evaluating classification quality that extend beyond traditional accuracy measures. These metrics could be valuable for other high-stakes classification tasks where understanding the structure of model uncertainty is critical for appropriate automation decisions.
The enhanced reliability demonstrated through improved confidence and dominance scores suggests promising opportunities for human–AI collaboration in mental health contexts. Rather than replacing human judgment, improved automated classification could augment clinical decision-making by providing structured analysis of complex textual content. The ability to identify primary themes automatically could help clinicians focus their attention on the most critical aspects of patient presentations. However, successful integration requires careful attention to the division of labor between automated systems and human experts. Our findings suggest that few-shot prompting can enhance the reliability of automated preliminary assessment, but the ultimate responsibility for clinical decisions must remain with qualified professionals.

5. Conclusions

This study demonstrates that few-shot prompting substantially enhances multiple dimensions of eating disorder content classification quality. The improvements in confidence (59.4%), focus (38.6%), and dominance (76.1%) represent not merely incremental enhancements but fundamental improvements in classificatory reliability and informativeness. These findings have significant implications for the development of automated mental health content analysis systems and suggest promising directions for human–AI collaboration in clinical contexts.
The systematic nature of improvements across nearly 90% of individual predictions, combined with the multi-dimensional enhancement across confidence, decisiveness, and dominance, provides strong evidence for the value of few-shot prompting in specialized domain classification tasks. As computational mental health tools become increasingly prevalent, the enhanced reliability demonstrated in this study could contribute to safer and more effective automated support systems for individuals with eating disorders and other mental health conditions.
Future research should focus on validating these technical improvements against clinical outcomes and expert assessments, while developing best practices for few-shot example selection and prompt engineering in mental health contexts. The ultimate goal of computational mental health research is not merely technical advancement but meaningful improvement in care accessibility, quality, and outcomes for individuals experiencing mental health challenges.

Author Contributions

Conceptualization, A.P., S.A.S., L.H., Y.W. and S.M.; methodology, A.P., L.H., Y.W. and S.M.; software, A.P.; validation, A.P.; formal analysis, A.P. and Y.W.; investigation, A.P.; resources, A.P.; data curation, A.P. and S.A.S.; writing—original draft preparation, A.P. and S.M.; writing—review and editing, A.P., L.H., Y.W. and S.M.; visualization, A.P.; supervision, S.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to its not meeting the criteria for human subjects research (no intervention, interaction, or use of identifiable or private information.)

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets presented in this article are not readily available because the data are part of an ongoing study. Requests to access the datasets should be directed to prasada@uwm.edu.

Acknowledgments

The authors are grateful to the University of Wisconsin for its support for faculty and student research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Dane, A.; Bhatia, K. The social media diet: A scoping review to investigate the association between social media, body image and eating disorders amongst young people. PLOS Glob. Public Health 2023, 3, e0001091. [Google Scholar] [CrossRef] [PubMed]
  2. Zhou, S.; Zhao, Y.; Bian, J.; Haynos, A.F.; Zhang, R. Exploring eating disorder topics on Twitter: Machine learning approach. JMIR Med. Inform. 2020, 8, e18273. [Google Scholar] [CrossRef] [PubMed]
  3. Guo, Y.; Ovadje, A.; Al-Garadi, M.A.; Sarker, A. Evaluating large language models for health-related text classification tasks with public social media data. J. Am. Med. Inform. Assoc. 2024, 31, 2181–2189. [Google Scholar] [CrossRef] [PubMed]
  4. Erriu, M.; Cimino, S.; Cerniglia, L. The role of family relationships in eating disorders in adolescents: A narrative review. Behav. Sci. 2020, 10, 71. [Google Scholar] [CrossRef] [PubMed]
  5. Hewlings, S.J. Eating disorders and dietary supplements: A review of the science. Nutrients 2023, 15, 2076. [Google Scholar] [CrossRef] [PubMed]
  6. LaMarre, A.; Rice, C. Recovering uncertainty: Exploring eating disorder recovery in context. Cult. Med. Psychiatry 2021, 45, 706–726. [Google Scholar] [CrossRef] [PubMed]
  7. Lee, Y.; Son, K.; Kim, T.S.; Kim, J.; Chung, J.J.Y.; Adar, E.; Kim, J. One vs. many: Comprehending accurate information from multiple erroneous and inconsistent AI generations. In Proceedings of the Seventh Annual ACM Conference on Fairness, Accountability, and Transparency, Rio de Janeiro, Brazil, 3–6 June 2024; pp. 2518–2531. [Google Scholar]
  8. Lundblade, K.M. Sorting Things Out: Critically Assessing the Impact of Reddit’s Post Sorting Algorithms on Qualitative Analysis Methods. In Proceedings of the SIGDOC ’23: The 41st ACM International Conference on Design of Communication, Orlando, FL, USA, 9–11 October 2023. [Google Scholar]
Figure 1. Annotation Instructions.
Figure 1. Annotation Instructions.
Biomedinformatics 05 00040 g001
Figure 2. Example description for a theme to be annotated.
Figure 2. Example description for a theme to be annotated.
Biomedinformatics 05 00040 g002
Figure 3. Zero-shot vs. Few-shot Category Distribution Analysis.
Figure 3. Zero-shot vs. Few-shot Category Distribution Analysis.
Biomedinformatics 05 00040 g003
Table 1. Zero-Shot Prompt Template.
Table 1. Zero-Shot Prompt Template.
ComponentContentPurpose
Role Definition“You are an expert in classifying posts about eating disorders according to specific themes.”Establishes domain expertise and task context.
Theme ConstraintsNumbered list of 14 predefined themes (Eating Disorder Symptoms and Behaviors, Weight, Body Image, etc.)Constrains model vocabulary to valid themes.
Target Input“Classify this post: [POST_TEXT]”Presents classification target.
Output FormatScore assignment (0.0–1.0) with structured formatting requirementsEnsures parseable, consistent responses.
Table 2. Few-Shot Prompt Template.
Table 2. Few-Shot Prompt Template.
ComponentDescriptionContent ExampleWord Count
Role DefinitionSimilar expert framing as zero-shot with enhanced emphasis“You are an expert in classifying posts about eating disorders according to specific themes. Your task is to ONLY identify themes that are EXPLICITLY mentioned in the text, no implied or inferred themes.”~30 words
Annotation GuidelinesFull theme definitions with keywords and contextual information“ANNOTATION GUIDELINES:
[Full guidelines content loaded from annotation_guidelines.txt]”
~2000 words
Additional Disambiguation NotesDetailed guidance on distinguishing between overlapping themes“ADDITIONAL NOTES ON CONFLICTING THEMES: [Additional notes content loaded from annotation_additional_notes.txt]”~2000 words
Theme ListSame 14 themes with emphasis on exact matching“IMPORTANT: Only use EXACTLY these theme labels:
1. Eating Disorder Symptoms and Behaviors
2. Weight
3. Body Image
[…continues for all 14 themes]
~60 words
Annotated ExamplesUp to 100 randomly selected annotated examples“Here are [N] examples of how to classify posts with EXPLICIT themes only:
Example 1:
Post: ‘[EXAMPLE_POST_1]’ Classification:
Theme_A: 0.6
Theme_B: 0.4
[..continues for up to 100 examples]”
~20,000 words
Target PostThe post to be classified“Now, please classify this new post based ONLY on what is explicitly stated: ‘[POST_TEXT]’Variable
Enhanced InstructionsOriginal instructions with additional anti-hallucination rules“CRITICAL RULES:
1. Do NOT include themes that are merely implied
2. Only include themes with direct textual evidence
3. If uncertain about a theme, DO NOT include it Only assign scores to themes genuinely and explicitly present in the text.”
~250 words
Table 3. Comparison of zero-shot and few-shot approaches.
Table 3. Comparison of zero-shot and few-shot approaches.
AspectZero-ShotFew-Shot
Total Context Usage~300–500 words~20,000–25,000 words
Guidance ProvidedBasic role + theme listComprehensive guidelines + disambiguation + annotated examples
Anti-Hallucination MeasuresSimple instructionsEnhanced rules with explicit textual evidence emphasis
Examples for ContextNoneUp to 100 annotated demonstrations
Table 4. Classification Concordance Between Zero-Shot and Few-Shot Approaches.
Table 4. Classification Concordance Between Zero-Shot and Few-Shot Approaches.
AnalysisValue
Total Posts Analyzed4458
Concordant Classifications1372
Discordant Classifications3086
Concordance Rate (%)30.78%
Discordance Rate (%)69.22%
Table 5. Confidence Scores by Prompting Method.
Table 5. Confidence Scores by Prompting Method.
MetricZero-ShotFew-ShotDifference% Change
Mean0.2660.424+0.158+59.4%
Median0.2350.410+0.175+74.5%
Standard Deviation0.1470.153+0.006+4.1%
Comparative Outcomes
Few-Shot > Zero-Shot__87.26%_
Few-Shot = Zero-Shot__2.09%_
Few-Shot < Zero-Shot__10.66%_
Table 6. Distribution of Confidence Scores by Range.
Table 6. Distribution of Confidence Scores by Range.
Confidence CategoryZero-ShotFew-ShotChangeInterpretation
Low Confidence
(0.0–0.3)
71.4%18.5%−52.9%Substantial reduction in uncertain predictions
Medium Confidence
(0.3–0.6)
24.8%69.0%+44.2%Major increase in moderate certainty
High Confidence
(0.6–0.1)
3.9%12.5%+8.6%Notable increase in high certainty predictions
Table 7. Focus Scores by Prompting Method.
Table 7. Focus Scores by Prompting Method.
MetricZero-ShotFew-ShotDifference% Change
Mean−1.715−1.053+0.662+38.6%
Median−1.782−1.071+0.711+39.9%
Standard Deviation0.5630.379−0.184−32.7%
Comparative Outcomes
Few-Shot > Zero-Shot__89.61%_
Few-Shot = Zero-Shot__1.05%_
Few-Shot < Zero-Shot__9.33%_
Table 8. Distribution of Focus Scores by Range.
Table 8. Distribution of Focus Scores by Range.
Focus CategoryRangeInterpretationZero-ShotFew-ShotChangeEffect
Very Diffuse−3.00 to −2.10High uncertainty across many themes29.4%0.8%−28.6%Dramatic Reduction
Moderately Diffuse−2.10 to −1.50Moderate spread across themes43.3%11.6%−31.7%Substantial Reduction
Moderately Focused −1.50 to −0.90Emerging clarity on primary themes19.3%57.5%+38.2%Major Increase
Highly
Focused
−0.90 to 0.00Clear confidence in specific themes8.0%30.2%+22.2%Substantial Increase
Table 9. Dominance Scores by Prompting Method.
Table 9. Dominance Scores by Prompting Method.
MetricZero-ShotFew-ShotDifference% Change
Mean0.460.81+0.35+76.1%
Median0.300.67+0.37+123.33%
Standard Deviation0.520.68+0.16+30.8%
Comparative Outcomes
Few-Shot > Zero-Shot__86.3%_
Few-Shot = Zero-Shot__4.1%_
Few-Shot < Zero-Shot__9.6%_
Table 10. Distribution of Dominance Scores by Range.
Table 10. Distribution of Dominance Scores by Range.
Dominance
Category
RangeThreshold
Meaning
Zero-ShotFew-ShotChangeStrategic Implication
Weak Primary Theme0.0–0.5Top < all others combined78.5%30.9%−47.6%Reduced ambiguous classifications
Emerging Primary0.5–1.0Top ~ half of others12.8%38.0%+25.2%Increased moderate clarity
Clear Primary1.0–1.5Top > all others combined5.0%20.0%+15.0%Enhanced decisive classification
Strong Dominance1.5+Top >> others3.7%11.1%+7.4%Improved high-confidence decisions
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Prasad, A.; Shalmani, S.A.; He, L.; Wang, Y.; McRoy, S. Identifying Themes in Social Media Discussions of Eating Disorders: A Quantitative Analysis of How Meaningful Guidance and Examples Improve LLM Classification. BioMedInformatics 2025, 5, 40. https://doi.org/10.3390/biomedinformatics5030040

AMA Style

Prasad A, Shalmani SA, He L, Wang Y, McRoy S. Identifying Themes in Social Media Discussions of Eating Disorders: A Quantitative Analysis of How Meaningful Guidance and Examples Improve LLM Classification. BioMedInformatics. 2025; 5(3):40. https://doi.org/10.3390/biomedinformatics5030040

Chicago/Turabian Style

Prasad, Apoorv, Setayesh Abiazi Shalmani, Lu He, Yang Wang, and Susan McRoy. 2025. "Identifying Themes in Social Media Discussions of Eating Disorders: A Quantitative Analysis of How Meaningful Guidance and Examples Improve LLM Classification" BioMedInformatics 5, no. 3: 40. https://doi.org/10.3390/biomedinformatics5030040

APA Style

Prasad, A., Shalmani, S. A., He, L., Wang, Y., & McRoy, S. (2025). Identifying Themes in Social Media Discussions of Eating Disorders: A Quantitative Analysis of How Meaningful Guidance and Examples Improve LLM Classification. BioMedInformatics, 5(3), 40. https://doi.org/10.3390/biomedinformatics5030040

Article Metrics

Back to TopTop