Next Article in Journal
Face Morphing Attack Detection Using Similarity Score Patterns Between De-Morphed and Live Images
Previous Article in Journal
Oscillation Propagation Analysis of Grid-Connected Converter System with New eVSG Control Patterns
Previous Article in Special Issue
Fine-Tuning Methods and Dataset Structures for Multilingual Neural Machine Translation: A Kazakh–English–Russian Case Study in the IT Domain
 
 
Article
Peer-Review Record

Who Speaks to Whom? An LLM-Based Social Network Analysis of Tragic Plays

Electronics 2025, 14(19), 3847; https://doi.org/10.3390/electronics14193847
by Aura Cristina Udrea 1,2, Stefan Ruseti 1,2, Laurentiu-Marian Neagu 1,2, Ovio Olaru 2, Andrei Terian 2 and Mihai Dascalu 1,2,3,*
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3:
Electronics 2025, 14(19), 3847; https://doi.org/10.3390/electronics14193847
Submission received: 15 August 2025 / Revised: 21 September 2025 / Accepted: 26 September 2025 / Published: 28 September 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper “Who Speaks to Whom? An LLM-Based Social Network Analysis of Tragic Plays” presents an innovative application of large language models (LLMs) to the analysis of tragic plays, moving beyond traditional co-occurrence heuristics toward directed speech act modeling. The contribution is timely and significant, bridging digital humanities and computational linguistics in a way that promises practical impact through both methodological advancement and future platform development. However, I have the following comments:

1. Lines 47 -49 are packed with multiple ideas, which can obscure the main message. I suggest that the authors break the statements down for easy readability and comprehension.
2. The introduction could benefit from a stronger narrative on why literary scholars and 
Do computational scientists alike care about addressee detection in tragic plays?
3. The phrase “nuancing our understanding” in line 50 is slightly awkward; consider rephrasing
4. Much of the related work is descriptive, summarizing who did what (Moretti, Elsner & Charniak, Joty, etc.), 
without critically assessing the gaps, weaknesses, or relevance of each approach.
5. While computational and NLP-based works are well covered, there is limited engagement with literary or dramaturgical theories beyond Moretti and Trilcke.
6. Section 2.4 mentions scalability vs. accuracy, but the discussion remains high-level. The authors
should provide a more systematic critique of trade-offs: e.g., annotation cost vs. transferability, interpretability vs. performance etc.
7. Furthermore in section 2.4 emphasize the novelty gap by highlighting that most prior LLM studies 
are on narrative fiction, online dialogue and not tragic plays. This makes the paper’s contribution clearer.
8. Opposed to the statement made by the authors in line 202, Cohen’s κ = 0.63 suggests moderate agreement
as opposed to high agreement stated
9. While the controlled overlap is technically sound, the rationale for choosing step size = k−2 is not clearly explained.
10. There is no quantitative comparison of performance across window sizes presented in this section; this leaves readers unsure which setting was optimal.
11. LLMs may show cultural or linguistic bias in addressee detection, especially across underrepresented languages (Romanian, Polish). This risk is not acknowledged
12. The section lacks statistical significance testing (e.g., t-tests, ANOVA, or confidence intervals) to demonstrate whether performance differences between models or window sizes are meaningful rather than incidental.
13. The discussion notes the 9.66% gap between partial and exact matches, but the explanation is brief.

 

Comments on the Quality of English Language

The English language quality is competent, clear, and professional

Author Response

Comment 0: The paper “Who Speaks to Whom? An LLM-Based Social Network Analysis of Tragic Plays” presents an innovative application of large language models (LLMs) to the analysis of tragic plays, moving beyond traditional co-occurrence heuristics toward directed speech act modeling. The contribution is timely and significant, bridging digital humanities and computational linguistics in a way that promises practical impact through both methodological advancement and future platform development. However, I have the following comments:
Response 0: Thank you kindly for your thorough review and suggestions.


Comment 1: Lines 47 -49 are packed with multiple ideas, which can obscure the main message. I suggest that the authors break the statements down for easy readability and comprehension. 
Response 1: Thank you for the suggestion, we have amended that paragraph.


Comment 2: The introduction could benefit from a stronger narrative on why literary scholars and do computational scientists alike care about addressee detection in tragic plays?
Response 2: We agree with this and we have added a paragraph after line 27.

Comment 3: The phrase “nuancing our understanding” in line 50 is slightly awkward; consider rephrasing
Response 3: We have rephrased the paragraph.

Comment 4: Much of the related work is descriptive, summarizing who did what (Moretti, Elsner & Charniak, Joty, etc.),  without critically assessing the gaps, weaknesses, or relevance of each approach.
Response 4: We have added more critical assessments.

Comment 5: While computational and NLP-based works are well covered, there is limited engagement with literary or dramaturgical theories beyond Moretti and Trilcke.
Response 5: We have added 2 paragraphs covering more literary/dramaturgical theories.

Comment 6: Section 2.4 mentions scalability vs. accuracy, but the discussion remains high-level. The authors should provide a more systematic critique of trade-offs: e.g., annotation cost vs. transferability, interpretability vs. performance etc.
Response 6: We have added more details on the suggested trade-offs. 

Comment 7: Furthermore in section 2.4 emphasize the novelty gap by highlighting that most prior LLM studies  are on narrative fiction, online dialogue and not tragic plays. This makes the paper’s contribution clearer.
Response 7: Thank you for the suggestion, we have added a paragraph at the end of the section regarding this.


Comment 8: Opposed to the statement made by the authors in line 202, Cohen’s κ = 0.63 suggests moderate agreement as opposed to high agreement stated
Response 8: We have revised our statement.


Comment 9: While the controlled overlap is technically sound, the rationale for choosing step size = k−2 is not clearly explained.
Response 9: We have added a short explanation: “This 2-line overlap preserves speaker–addressee continuity across windows while preventing excessive redundancy that could bias model predictions toward repeated patterns.”


Comment 10: There is no quantitative comparison of performance across window sizes presented in this section; this leaves readers unsure which setting was optimal.
Response 10: Indeed, this raises an important concern about potential outliers inflating whole scene performance (which seems to be the optimal one). We conducted comprehensive robustness analysis revealing that 81.8% of individual scene measurements (36/44) showed improvement from WS5 to whole scene context. This systematic improvement pattern held across all three models (78.6-87.5% of scenes per model). Outlier detection using IQR methods found minimal outlier impact (≤2.84 percentage points change when outliers removed). Statistical significance testing confirmed improvements were not due to chance (all p < 0.05). We have added the subsection 4.2 with all these details.


Comment 11: LLMs may show cultural or linguistic bias in addressee detection, especially across underrepresented languages (Romanian, Polish). This risk is not acknowledged
Response 11: We have added an acknowledgment of this issue in the Discussion section. We have also added a table comparing the performance by language and it seems that Romanian and Polish had a better performance than German, but we believe a future study with a larger corpora would be more conclusive.
 
Comment 12: The section lacks statistical significance testing (e.g., t-tests, ANOVA, or confidence intervals) to demonstrate whether performance differences between models or window sizes are meaningful rather than incidental.
Response 12: We have added an ANOVA test in the Section 4.2, which was quite insightful so thank you for the suggestion. 


Comment 13: The discussion notes the 9.66% gap between partial and exact matches, but the explanation is brief.
Response 13: We added a more extensive explanation.

Reviewer 2 Report

Comments and Suggestions for Authors

The authors present an LLM-based pipeline to identify addressees (“who speaks to whom”) in tragic plays and then build directed, weighted social networks from those speech acts. Evaluated on 9 plays across 4 languages, their best model (Llama-3.3-70B) reaches 77.31% exact match and 86.97% partial match against expert annotations, improving over common co-occurrence/adjacency heuristics used in literary SNA. They visualize the resulting networks and analyze density and centrality patterns to surface power dynamics among characters. Code (vLLM inference) and data sources (DraCor) are provided. Overall, this is a timely and well-structured bridge between NLP and computational literary studies, with clear methodological transparency and substantive literary utility.

Weakness:

  • Only 14 “complex” scenes from 9 plays in 4 languages were used; scenes were pre-filtered to have ≥5 participants and specific dialogic features, which limits generalization to simpler dialogues and other genres.
  • For model/window comparisons (Table 4, Figures 2–3), add CIs or paired tests across scenes to support claims that whole-scene context outperforms smaller windows.
  • ollapsing AUDIENCE → ALL raised exact match from 25% to 75% in one scene, and treating LEUTE (“people”) as equivalent to ALL boosted exact/partial to 70%/78% in another—this makes headline numbers sensitive to annotation policy rather than model ability. 
  • Complement per-scene Table 5 with a per-language summary (EN/DE/RO/PL).
  • Provide decoding parameters (temperature, top-p/top-k), random seeds, hardware/runtime, Git commit hash, and precise access dates for DraCor and the repo.
  • Add the corpus size (9 plays/14 scenes/4 languages), κ=0.63, and a one-clause definition of partial match.
  • Related work section need to be propose a state-of-the-art "Beyond Word-Based Model Embeddings: Contextualized Representations for Enhanced Social Media Spam Detection" argues that contextualized, transformer-based representations outperform surface/word-level heuristics on noisy, dialogic text. That’s tightly analogous to this manuscript’s claim that LLMs beat adjacency/co-occurrence heuristics for “who-speaks-to-whom” in plays. "Predictive Analytics in Mental Health Leveraging LLM Embeddings and ML for Social Media Analysis "  demonstrates a practical LLM-embedding pipeline with downstream ML, under short, informal, multilingual-leaning text—useful to justify representational choices and pipeline transparency (prompts/embeddings → evaluation). "A Student-Centric Evaluation Survey to Explore the Impact of LLMs on UML Modeling" models structured LLM evaluation (rubrics, reliability, clearer reporting). This can strengthen the current paper’s evaluation & uncertainty reporting (e.g., adding CIs, per-subset breakdowns).
  • Authors acknowledge the model “tends to over-generate potential receivers,” evidenced by a ~9.66% gap between partial and exact match, indicating inflated partial scores and unclear precision.
  • Expand the discussion of over-generation with an error taxonomy (e.g., over-broad collective receivers; missed aside), a few qualitative examples, and a mention of potential calibrations (list-size penalties, re-ranker).
  • Consider adding n-labels and uncertainty overlays for readability.
  • Briefly justify the 0.8 coverage threshold for “dominant participants” or provide a sensitivity note.
  • Inter-annotator agreement is k= 0.63; while acceptable for a hard task, the moderate ceiling makes small model differences hard to interpret without deeper error analysis.
  • It presents compelling graph visualizations and qualitative readings, but there is no quantitative validation linking network metrics to independent literary judgments or downstream tasks.
Comments on the Quality of English Language

Prefer precise verbs over “significant/robust/strong” unless backed by statistics; avoid hedging (“somewhat,” “rather”) and consistent tense .

Author Response

Comment 0: The authors present an LLM-based pipeline to identify addressees (“who speaks to whom”) in tragic plays and then build directed, weighted social networks from those speech acts. Evaluated on 9 plays across 4 languages, their best model (Llama-3.3-70B) reaches 77.31% exact match and 86.97% partial match against expert annotations, improving over common co-occurrence/adjacency heuristics used in literary SNA. They visualize the resulting networks and analyze density and centrality patterns to surface power dynamics among characters. Code (vLLM inference) and data sources (DraCor) are provided. Overall, this is a timely and well-structured bridge between NLP and computational literary studies, with clear methodological transparency and substantive literary utility. 
Response 0: Thank you kindly for your thorough review and suggestions.


Comment 1: Only 14 “complex” scenes from 9 plays in 4 languages were used; scenes were pre-filtered to have ≥5 participants and specific dialogic features, which limits generalization to simpler dialogues and other genres.
Response 1: Our assumption was that if the models perform well on those complex scenes, they will perform even better on easier ones. We have tested on a couple of simpler scenes from Phedre by Racine (in French) and indeed the accuracy was even better (the percentages for partial and exact match were near 100%). We did not include those in the paper as we wanted to keep the evaluation only on those complex scenes selected by the literary historians in our team. The project that this article is part of focuses on tragedy and this is why we have chosen this genre but we agree that a future study can benefit from comparison with other genres as well.


Comment 2: For model/window comparisons (Table 4, Figures 2–3), add CIs or paired tests across scenes to support claims that whole-scene context outperforms smaller windows.
Response 2: We have added the subsection 4.2 with more details on this point.


Comment 3: Collapsing AUDIENCE → ALL raised exact match from 25% to 75% in one scene, and treating LEUTE (“people”) as equivalent to ALL boosted exact/partial to 70%/78% in another—this makes headline numbers sensitive to annotation policy rather than model ability. 
Response 3: The case of AUDIENCE -> ALL was specific to that play (Anarchia domowa), the annotators admitted that the receiver was unclear and could be considered either ALL or AUDIENCE, and this is the reason why we kept the high percentage (75%) as the final result in the evaluation. For all the other plays, AUDIENCE and ALL were considered as two separate receivers. In the case of “LEUTE” we kept the smaller percentage in the evaluation because we do generally consider the mob as being a different receiver than ALL. We also want to note that this is an important point that we would like to take into consideration for future studies: how to demarcate between the mob, AUDIENCE and ALL correctly and how to solve the reference of ALL (is ALL referring to all characters from the scene/ from the play etc? - this is a question that we will explore in a future study).


Comment 4: Complement per-scene Table 5 with a per-language summary (EN/DE/RO/PL).
Response 4: This is a great suggestion, thank you, we’ve added the table.


Comment 5: Provide decoding parameters (temperature, top-p/top-k), random seeds, hardware/runtime, Git commit hash, and precise access dates for DraCor and the repo.
Response 5: We’ve added a table for the parameters and more details on the hardware. We have also added the Github version in the Contributions part. We have added the access date for DraCor in the Dataset section as well. In the Contributions part we already have the dates for both repo and DraCor.


Comment 6: Add the corpus size (9 plays/14 scenes/4 languages), κ=0.63, and a one-clause definition of partial match.
Response 6: We have added this info in the abstract.


Comment 7: Related work section need to be propose a state-of-the-art "Beyond Word-Based Model Embeddings: Contextualized Representations for Enhanced Social Media Spam Detection" argues that contextualized, transformer-based representations outperform surface/word-level heuristics on noisy, dialogic text. That’s tightly analogous to this manuscript’s claim that LLMs beat adjacency/co-occurrence heuristics for “who-speaks-to-whom” in plays. "Predictive Analytics in Mental Health Leveraging LLM Embeddings and ML for Social Media Analysis "  demonstrates a practical LLM-embedding pipeline with downstream ML, under short, informal, multilingual-leaning text—useful to justify representational choices and pipeline transparency (prompts/embeddings → evaluation). "A Student-Centric Evaluation Survey to Explore the Impact of LLMs on UML Modeling" models structured LLM evaluation (rubrics, reliability, clearer reporting). This can strengthen the current paper’s evaluation & uncertainty reporting (e.g., adding CIs, per-subset breakdowns).
Response 7: Thank you for your suggestions of papers, we have added them in the newly added Section 2.4 (the previous one is now 2.5.


Comment 8: Authors acknowledge the model “tends to over-generate potential receivers,” evidenced by a ~9.66% gap between partial and exact match, indicating inflated partial scores and unclear precision.
Expand the discussion of over-generation with an error taxonomy (e.g., over-broad collective receivers; missed aside), a few qualitative examples, and a mention of potential calibrations (list-size penalties, re-ranker).
Response 8: We’ve chosen to categorize the errors into 4 different types and we added examples for each of them. We also added a mention of potential calibrations.


Comment 9: Consider adding n-labels and uncertainty overlays for readability.
Response 9: We haven’t added the n-label as the number remains constant for all models and window sizes for the performance plots. As for the uncertainty overlays, the plots rather focus on average performance comparisons.


Comment 10: Briefly justify the 0.8 coverage threshold for “dominant participants” or provide a sensitivity note.
Response 10: Thank you, we added an explanation.


Comment 11: Inter-annotator agreement is k= 0.63; while acceptable for a hard task, the moderate ceiling makes small model differences hard to interpret without deeper error analysis. It presents compelling graph visualizations and qualitative readings, but there is no quantitative validation linking network metrics to independent literary judgments or downstream tasks.
Response 11: We recognize the need for external validation of our network metrics. We are already planning a future study that will address this, as well as the agreement gap. 


Comments on the Quality of English Language
Comment: Prefer precise verbs over “significant/robust/strong” unless backed by statistics; avoid hedging (“somewhat,” “rather”) and consistent tense 
Response: We have added the necessary corrections. 

Reviewer 3 Report

Comments and Suggestions for Authors

At several points, the paper reads as if LLM-based addressee detection is a solved problem (e.g., calling the method “superior” to conventional SNA). But your own results show weaknesses, especially in scenes like Dantons Tod, where performance is very low. This needs to be toned down and contextualized.

 

The dataset is small (9 plays, 14 scenes). While I understand the annotation burden, the limited sample size makes it hard to claim strong generalizability. For example, can we really extrapolate to all of DraCor from this? I  encourage you either (a) extend the dataset, or (b) more explicitly acknowledge that results are preliminary and not representative of all dramatic texts.

The distinction between “exact match” and “partial match” is useful, but the interpretation of partial matches is quite forgiving. For instance, if a model lists multiple receivers and one is correct, it counts as a hit. That inflates performance. Readers need to see a precision–recall trade-off or F1 scores to better understand the balance between over-generation and accuracy.

the literary analysis feels underdeveloped. The discussion mentions power dynamics and isolation, but the actual interpretive insights are limited to describing a few graphs (Hamlet, Penthesilea, Don Carlos). To strengthen the paper, the authors should show at least one deeper literary finding that emerges only thanks to this LLM-based method.

The reference list has some redundancies ([2] and [9] look like duplicates). Please clean this up.

It’s not always obvious which model outputs are open-sourced—are the annotations and model prompts also released, or only the pipeline?

Author Response

Comment 1: At several points, the paper reads as if LLM-based addressee detection is a solved problem (e.g., calling the method “superior” to conventional SNA). But your own results show weaknesses, especially in scenes like Dantons Tod, where performance is very low. This needs to be toned down and contextualized.
Response 1: We agree and we have changed our statement.


Comment 2: The dataset is small (9 plays, 14 scenes). While I understand the annotation burden, the limited sample size makes it hard to claim strong generalizability. For example, can we really extrapolate to all of DraCor from this? I  encourage you either (a) extend the dataset, or (b) more explicitly acknowledge that results are preliminary and not representative of all dramatic texts.
Response 2: We have added explicit acknowledgments that the dataset is preliminary and, in consequence, the results as well. We plan to extend the dataset in a future study.


Comment 3: The distinction between “exact match” and “partial match” is useful, but the interpretation of partial matches is quite forgiving. For instance, if a model lists multiple receivers and one is correct, it counts as a hit. That inflates performance. Readers need to see a precision–recall trade-off or F1 scores to better understand the balance between over-generation and accuracy.
Response 3: We’ve added P, R and F1.


Comment 4: The literary analysis feels underdeveloped. The discussion mentions power dynamics and isolation, but the actual interpretive insights are limited to describing a few graphs (Hamlet, Penthesilea, Don Carlos). To strengthen the paper, the authors should show at least one deeper literary finding that emerges only thanks to this LLM-based method.
Response 4: We’ve added a literary insight in regards to the scene density in Section 4.3. We plan on expanding this research in a future study that will be focused on visualizations and the literary insights that can be extracted from them, this is why in this study we kept the focus on addressee detection. Additionally, we can infer a strained relationship between characters who in close readings are known to have a close relation, even if it is negatively charged, e.g., between Hamlet and the king. These one-sided exchanges, notwithstanding the symbolic link between the characters, is indicative of a disruption in the relationship. 


Comment 5: The reference list has some redundancies ([2] and [9] look like duplicates). Please clean this up.
Response 5: We have fixed the issue.


Comment 6: It’s not always obvious which model outputs are open-sourced—are the annotations and model prompts also released, or only the pipeline?
Response 6: The model prompts are included in the open-source code. We’ve now also made public the human annotations and added the url in the paper.

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The authors have done a commendable job responding comprehensively to the comments:

  • Comments 1 and 3 were handled well, with rephrasing and restructuring to enhance flow.

  • Comment 2 was key, and the added paragraph strengthens the motivation for both literary scholars and computational scientists.

  • Comments 4 and 5 were particularly critical, and I am pleased they deepened the critique while broadening engagement with dramaturgical theory. This now provides a balanced foundation.

  • For Comments 6, 7, 9, 10, and 12, the authors have provided thorough, quantitative, and statistically significant analyses. The robustness checks, systematic trade-off discussions, and ANOVA testing substantially raise the methodological rigor.

  • The correction regarding Cohen’s κ (Comment 8) and the more detailed treatment of partial vs. exact matches (Comment 13) show careful scholarly integrity.

  • Comment 11’s inclusion of bias acknowledgment and a cross-language performance table strengthens the study’s transparency and sets up fruitful future work.

Overall, the revisions elevate the paper from being innovative but slightly descriptive to a methodologically rigorous and critically engaged piece. I find the manuscript now well-prepared for publication.

Author Response

Thank you kindly.

Reviewer 2 Report

Comments and Suggestions for Authors
  • The authors addressed my comments very well
  • I'm satisfied with the response, and I see improvement. 

Author Response

Thank you kindly.

Reviewer 3 Report

Comments and Suggestions for Authors

The literary framing is nice, but reviewers will want more on the computational/NLP contribution. At times the text reads like a literary theory paper with NLP sprinkled in. You need to emphasize the engineering/method more than the literary analysis.

The metrics are mostly exact vs partial match. While you do mention precision/recall/F1 later, this should be more central. Otherwise, partial match (where “any overlap counts”) can be criticized as artificially inflating results.

Some results (like German Dantons Tod) are very low. You acknowledge this but don’t fully explain why. Is it due to model limitations, language-specific data scarcity, or complexity of the play? Needs deeper error analysis.

LLMs are black-box models. Since you use them for interpretation-heavy tasks (literary analysis), readers may expect a short section on interpretability and reproducibility limits.

No discussion of bias- plays are culturally loaded texts, and models trained on web corpora may skew certain interpretations.

The related work section is thorough, but the “gap” is diluted across many sub-areas (literary SNA, addressee detection, LLM pragmatics). You should sharpen the gap: “No one has done directed addressee detection in multilingual drama corpora with LLMs.”

Generally clear, but some sections are too long-winded and theoretical (especially 2.1 and 2.5). Could tighten for readability and to better fit MDPI’s technical style.

Abstract: Already strong, but could mention precision/recall results to make performance more transparent.

Method Section: Explain why you picked Llama, Gemma, Qwen specifically-were they chosen for size, multilingual strength, or availability?

Results Section: Move the F1/precision/recall results earlier, not buried deep.

Discussion: Add a clear Limitations paragraph (e.g., model over-generates, some genres harder, annotation agreement only κ=0.63).

Figures/Tables: Very good, but maybe add one “side-by-side” comparison of heuristic vs LLM-based network graphs to make the improvement visually obvious.

References: Solid coverage, but check balance between NLP/AI venues vs literary studies (lean it slightly more toward technical).

 

Author Response

Comment 0: The literary framing is nice, but reviewers will want more on the computational/NLP contribution. At times the text reads like a literary theory paper with NLP sprinkled in. You need to emphasize the engineering/method more than the literary analysis.

Response 0: Thank you kindly for your thorough review. We have addressed all your comments and we believe that now the presentation has considerably improved

 

Comment 1: The metrics are mostly exact vs partial match. While you do mention precision/recall/F1 later, this should be more central. Otherwise, partial match (where “any overlap counts”) can be criticized as artificially inflating results.

Response 1: Thank you for pointing this out. We added a short argument why we think this metric is relevant: “The moderate agreement between the annotators illustrates the subjectivity of this task. As such, the partial match metric was introduced for a more realistic evaluation scenario, considering that an overlap with at least one human annotator could be seen as a plausible answer.”

 

Comment 2: Some results (like German Dantons Tod) are very low. You acknowledge this but don’t fully explain why. Is it due to model limitations, language-specific data scarcity, or complexity of the play? Needs deeper error analysis.

Response 2: An explanation is provided: “A telling instance is in \textit{Dantons Tod, Scene 2}, where the varying conceptions of the annotator of the collective addressees impacted the agreement scores. Baseline evaluation treats \textsc{leute} ("people") and \textsc{all} as distinct labels, yielding baseline exact/partial matches of 40.35\%/49.12\%. However, the semantic equivalence of these terms (based upon their functional equivalence as collective addressees) boosts agreement to 70.00\% (exact) and 78.00\% (partial).”

 

Comment 3: The related work section is thorough, but the “gap” is diluted across many sub-areas (literary SNA, addressee detection, LLM pragmatics). You should sharpen the gap: “No one has done directed addressee detection in multilingual drama corpora with LLMs.”

Response 3: We have added a paragraph at the end of the Related Works section: “Overall, to the best of our knowledge, while previous studies have examined literary social network analysis, addressee detection, and discourse modeling with LLMs, none have addressed the task of \emph{directed addressee detection in multilingual drama corpora using LLMs}. As such, our study is designed to address this gap.”

 

Comment 4: Generally clear, but some sections are too long-winded and theoretical (especially 2.1 and 2.5). Could tighten for readability and to better fit MDPI’s technical style.

Response 4: We agree, these 2 subsections had become over-descriptive based on previously received reviews. We have now rearranged them accordingly.

 

Comment 5: Abstract: Already strong, but could mention precision/recall results to make performance more transparent.

Response 5: We changed accordingly

 

Comment 6: Method Section: Explain why you picked Llama, Gemma, Qwen specifically-were they chosen for size, multilingual strength, or availability?

Response 6: Great point - we have added in the Methods section: “Llama 3.3:70B was included for its open availability and strong general-purpose performance at scale, supporting reproducibility. Gemma 3:27B was chosen for its efficient architecture and competitive English performance. Qwen 3:8B was selected for its documented multilingual coverage, essential for our cross-linguistic setting. Together, these open models provide a balance of efficiency and multilingual strength, while covering a broad parameter range (from 8B to 70B) that enables a robust evaluation.

 

Comment 7: Results Section: Move the F1/precision/recall results earlier, not buried deep.

Comment 7: Done. You are perfectly right, we moved them immediately after introducing the exact/partial matches scores.

 

Comment 8: LLMs are black-box models. Since you use them for interpretation-heavy tasks (literary analysis), readers may expect a short section on interpretability and reproducibility limits.

Response 8: Great point, we have added a Limitations section. 

 

Comment 9: No discussion of bias- plays are culturally loaded texts, and models trained on web corpora may skew certain interpretations.

Comment 9: We have added this in the Limitations sections

 

Comment 10:Discussion: Add a clear Limitations paragraph (e.g., model over-generates, some genres harder, annotation agreement only κ=0.63).

Response 10: All points (7-9) have been introduced in the Limitations section.

 

Comment 11: Figures/Tables: Very good, but maybe add one “side-by-side” comparison of heuristic vs LLM-based network graphs to make the improvement visually obvious.

Response 11: We have added a note - “It is important to note the improved readability of the network graphs compared to views without addressee identification, which become considerably more cluttered due to multiple unnecessary connections. For illustration, see the web view of \textit{Hamlet} on DraCor (\url{https://dracor.org/gersh/hamlet-prinz-von-daenemark}, accessed on 21 September 2025). We chose not to reproduce these views in the manuscript itself, as doing so would make the presentation more difficult to follow and too lengthy.

 

Comment 12: References: Solid coverage, but check balance between NLP/AI venues vs literary studies (lean it slightly more toward technical).

Response 12: Thank you for this suggestion. We have briefly expanded the Related Work section by incorporating recent NLP/AI contributions on addressee recognition, thereby strengthening the technical balance of the references.

Round 3

Reviewer 3 Report

Comments and Suggestions for Authors

i accept it in its current form

Back to TopTop