Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Assessing Questionable and Responsible Research Practices in Psychology Master’s Theses

Behav. Sci. 2026, 16(1), 110; https://doi.org/10.3390/bs16010110

by Hilde E. M. Augusteijn¹

, Jelte M. Wicherts¹

, Klaas Sijtsma¹

and Marcel A. L. M. van Assen^1,2,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3:

Anand Krishna

Behav. Sci. 2026, 16(1), 110; https://doi.org/10.3390/bs16010110

Submission received: 17 October 2025 / Revised: 5 January 2026 / Accepted: 8 January 2026 / Published: 13 January 2026

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

See attached file.

Comments for author File: Comments.pdf

Author Response

We thank the editor and reviewers for their constructive feedback. Below we respond to all comments in italics in detail. When we changed the text of our manuscript, we typically added the changed text in our response letter.

We believe that mainly two comments from the reviewers allowed us to improve our manuscript a lot. First, we added the results of multiple regression analyses to explain thesis grade. Second, our main interest is in the prevalence of RRPs and QRPs among recent psychology master’s students’ theses, as stated in the abstract and in the main research question, and not on the comparison with ‘the’ published literature. We made this clear(er) and pointed out that for a clean comparison with (some of the) published literature another study design is required than the one we used.

Here is our detailed response to the comments of the reviewers:

Reviewer 1

The authors present a the results of research looking into the prevalence of good and questionable research practices in master’s thesis at Tilburg University. They applied a thorough coding scheme to record the characteristics of the thesis and then correlated various characteristics with the grades awarded for the thesis. Overall this work contains a well-written introduction that outlines the key discussions in the area. The paper demonstrates methodological rigour, which is why I think it could warrant eventual publication.

Thank you for your positive evaluation of our manuscript.

My major reservations concern the statistical analysis and the framing of the discussion. I expand on these points below. First, I just wanted to highlight the high quality of the writing in the initial sections of the paper. The first paragraph introduces the topic beautifully and concisely. The whole section is well structured and the primary research presented here has strong justification - I enjoyed reading it. Second, the methodological rigour in this paper is outstanding, including the procedure of double coding and blind coding.

Thanks again.

My first major concern is about the statistical analysis. As it stands, the analysis doesn’treally show any useful information. I understand that your analysis is exploratory, but this doesn’t warrant running correlations on everything without further thought. Unfortunately the outcome variable is not present in the shared data (which I imagine is due to GDPR and that’s fair), so it’s hard to make recommendations for improvement without being able to poke around the data. But some practical considerations include:

Move Table 3 to the appendix. Vast majority of these correlations are not of interest given the research question of the present paper.

As it is normal practice to provide the descriptive statistics (including means of and correlations between the variables) in the manuscript, we decided to keep Table 3 in the main text. Table 3 is also not exceptionally large, with merely 16 variables in it. Finally, and most importantly, Table 3 is insightful, showing which manuscript characteristics are associated with each other.

This leaves us with Table 4. Based on the scripts, all the correlations in this table are simple Pearson correlations (or point-biserals for 0/1 coded categorical variables). However, based on the description of the variables, I doubt majority of them have the distribution suitable for Pearson’s correlation. The outcome variable itself is a distribution ranging from 1-10, truncated at 6. If a correlation analysis is to be used, consider Spearman or robust alternatives.

The outcome variable (thesis grade) is continuous and assumed to be of interval scale by universities, as the thesis grade is averaged with course grades to determine if a student deserves graduation with distinction or cum laude. All the other predictors in Table 4 are either both continuous and of ratio scale, or dichotomous. In the first case Pearson’s correlation is the standard measure of association. Spearman’s rank order correlation is only recommended in case of ordinal variables (we do not have them) or when one of the variables has extreme outliers (we do not have them). When the predictor is dichotomous, Pearson’s correlation (which we indeed compute as point-biserial correlations) is directly related to Cohen’s d, the standard measure of association when comparing two groups. To summarize, using Pearson’s correlation in our case is just standard practice.

Furthermore, the thesis grade distribution is not truncated at 6; thesis grade 6 only happens infrequently. The thesis grade is in the interval 6-10, with scores (6, 6.5, 7, 7.5, … 9.5, 10).

To conclude, we do not see any opportunity to improve the assessment and testing of associations between thesis grade and the other predictors in Table 4.

The correlations themselves should really only be the starting point, because a correlation coefficient is an extremely uninformative effect size.

We disagree that the correlation is an uninformative effect size. Actually, the correlation coefficient (or its unstandardized form, the covariance) is the basis for most statistical analyses, including regression analysis and SEM. We do agree, however, that we can do multivariate statistical analyses that use these correlations as building blocks, such as multiple regression (see below).

Knowing that thesis length correlates with the grade at r = .188 is not particularly useful. The next steps in the analysis can concern linear (or non-linear) modelling that allows you to account for multiple variables in the same model - these variables relate to each other and don’t exist in the vacuum, but simple correlations treat them as such. Is there a more complex model that can explain the relationships? For example, if we hold thesis length constant, what is the relationship between RRP, QRP and the grade? It’s a quick way to expand the models in a way that are more representative of the reality. Linear models would also provide you with parameter estimates that can be interpreted more intuitively. For categorical predictors, the beta estimates will tell you the difference in average grade (while holding other predictors constant), whereas for continuous predictors, it will tell you the change in the outcome associated with a change of one unit in the predictor, which is much more informative than a correlation coefficient.

True, we did not include a multiple regression analysis in our manuscript. We agree that the results of a multiple regression are a meaningful addition. Hence, we also carried out multiple regression analyses and added their result to the manuscript. To follow upon the suggestion of Reviewer 3, we also assessed the effects of predictors of the same type, categorizing the thesis characteristics into ‘neutral’, ‘QRP’, ‘RRP’ and ‘hypotheses’. We included the following text in our manuscript, just above Table 4 when describing the results predicting thesis grade:

“A multiple regression showed that all 18 predictors together did not explain thesis grade (R² = .094, adj. R² = .036, F_(18,279) =1.62, p = .056). Additionally, each type of thesis characteristics did not contribute to the explanation of thesis grade after controlling for the effect of predictors of other types: RRP (F_(7,279) = 0.93, p = .48), neutral (F_(3,279) = 2.29, p = .08), QRP (F_(3,279) = 2.12, p = .10), hypotheses (F_(5,279) = 1.02, p = .40). See Table 4 for the categorization of the predictors into the four types.”

I appreciate that the authors decided to go with a stringent alpha level, but given the exploratory nature of this work, this choice seems arbitrary. I’d argue that p-values in this scenario are not the appropriate tool for the job and they actively undermine the utility of the analysis. Instead, consider using Bayes Factors - even with uninformative priors, they can give you an indication of whether the hypothesis (null or alternative) is supported, not supported, or whether you simply lack the data to draw conclusions. These are just some suggestions based on reading the manuscript and looking through your scripts. I’m sure there are many more sophisticated ways in which this could be modelled.

Alpha levels are arbitrary in any case. But given the number of analyses we ran we wanted to limit the number of false positives, and as generally recommended in the literature we accomplished that by choosing a more stringent alpha level (akin to Bonferroni).

We decided not to implement Bayes Factors. Bayes Factors are not so well-known as p-values, and for finding support of a nonzero bivariate association both p-values and Bayes Factors provide highly comparable results, see, for example:

Johnson, V. E. (2013). Revised standards for statistical evidence. Proceedings of the National Academy of Sciences, 110(48), 19313-19317.

Wetzels, R., Matzke, D., Lee, M. D., Rouder, J. N., Iverson, G. J., & Wagenmakers, E. J. (2011). Statistical evidence in experimental psychology: An empirical comparison using 855 t tests. Perspectives on Psychological Science, 6(3), 291-298.

As our sample size is not small, our statistical power to detect medium to large associations is very high (.998 for a point-biserial correlation of .3 [medium effect size] with 300 theses and alpha-level = .005, according to G-Power), whether we use p-values or Bayes Factors. To conclude, using Bayes Factors or p-values will not affect our conclusions on practically meaningful correlations.

My second point concerns the framing of the discussion. The discussion jumps straight into comparing master students with published researchers. Of course, it is important to situate findings in existing literature but they way it is presented now makes it sound as if you applied the same coding scheme to published papers, collected your own data and then conducted an analysis to run this comparison. It’s a little misleading, and this impression is emphasised by the fact that the comparison is also highlighted in the abstract. The discussion should first highlight the main findings from your analysis before drawing links to existing literature.

We understand the confusion. We now immediately start the discussion with “Relating our findings to what we know about published manuscripts in the social sciences in general…” As immediately relating our findings to what we know about published manuscripts is more efficient, we did not change this.

If our primary goal was to systematically compare QRPs and RRPs in master theses to those in published manuscripts, we indeed also would have applied the same coding scheme to a selection of this literature. But this was not our goal. We added a limitation in the discussion section addressing this point, see also our response to the first main comment of Reviewer 2 below.

We did not adapt the abstract, as it made clear that we only examined 300 psychology master’s theses.

Minor things:

Abbreviations:

Please try to limit the use of abbreviations unless necessary. FFP can just be termed “research misconduct”, while “SP” and “CP” can use the original terms (social psychology and clinical psychology). None of these terms are used often enough in the main text (figure is fine) to warrant an abbreviation. Conversely, NHST is not defined.

Thank you, we agree upon all your suggestions, and modified the manuscript accordingly.

Line 90: preregistration of the research proposal is a relatively simple procedure, especially since students often write down their research plan for their supervisors, including many of the details that would also be included in the preregistration (Pownall, 2020).

I wouldn’t describe preregistration as a “relatively simple procedure”. A good preregistration will be a detailed protocol that contains full description of the methodology, a data processing script, and an analysis plan as well as contingency plans for when the original analysis is not feasible. Given the short time scale of masters’ project, it’s understandable that most students don’t undertake it.

We agree with you that a ‘good preregistration’ is not at all that simple. This is also clear from research examining the quality of preregistrations (e.g., the work of Olmo van den Akker and colleagues), which shows that very few preregistrations may qualify as ‘good’. On the other hand, preregistration using the standard format of AsPredicted.org is easy and not time-consuming, particularly when realizing that ‘students often write down their research plan for their supervisors’.

Hence we changed the text into “ … preregistration of the research proposal at, for instance, AsPredicted is a relatively simple procedure, …” (lines 94-98).

Line 195: The term “researchers’ degrees of freedom” is not really explained anywhere.

We added its definition between brackets, the first time we use it in the text: “(i.e., the many choices researchers make during a study, such as which variables to analyze, which participants to exclude, or when to stop data collection (Simmons et al., 2011))” (lines 82-86).

Line 198: Information about testing is only transparent and trustworthy when an analysis plan was preregistered, and the reported results concerning the planned hypotheses exactly match the preregistration or emphasizes any differences. This statement is too sweeping and needs nuance. Preregistration is not the ultimate stamp of trustworthiness. There are far too many poorly written preregistrations for this to be the case. While I personally think preregistration is good practice, it is far too easy to abuse for the purposes of performative transparency.

We agree that preregistration is not the ultimate stamp of trustworthiness, that’s why we added the second part of the sentence “… and the reported results concerning the planned hypotheses exactly match the preregistration or emphasizes any differences.” (lines 204-208).

We still adhere to this statement. We do agree that most preregistrations are not good (in line with Van den Akker’s findings), but that is another matter; if preregistrations are good and at the same time hypotheses, measurement, and analyses in the paper match those with the preregistration, then the information about testing is transparent and trustworthy.

To follow upon your suggestion, the only change we implemented is that we omitted the word “only” in our sentence.

Line 217: …inaccurate reporting or unfortunate rounding off a p-value… “inaccurate rounding” is more appropriate than “unfortunate” here.

Thanks, we deleted ‘unfortunately”

Line 446: For both items, coders needed to rely heavily on their own interpretation, thus introducing subjectivity. Can you add details about how any discrepancies were resolved?

By discussion. See Section 2.3 for the description of the development and implementation of the coding protocol.

Line 465: Please add R version and relevant R packages with citations if relevant.]

We added that we used R version 4.4.3.

Line 657: If researchers teach their students how to conduct research according to textbook rules, why do they forgo RRPs and engage in QRPs when publishing a manuscript? This suggest that this is not just ignorance, but that researchers believe that the rules of game are different for master’s theses than for their own manuscripts. I think this perspective is a little simplistic. First, it is very likely that students are not taught research methods and statistics by the same people that supervise them. Often, statistics lecturers will be members of the faculty dedicated to research methods, who are therefore more likely to have up to date knowledge about good statistical practice compared to an average supervisor. Second, this has likely nothing to do with beliefs about “the rules of the game” rather than the pressures under which researchers operate, which are very different for masters students. As you pointed out earlier, majority of QRPs are likely not malicious but may occur as a result of biases towards certain types of results. These biases arise due to the pressure to publish papers and secure grants. As you also highlighted, majority of masters thesis do not get published, so the supervisors will naturally treat them differently than they treat work intended for publication.

We agree with you that our final sentences of the paragraph were too short and simplistic. We kept the “If researchers teach their students how to conduct research according to textbook rules, why do they forgo RRPs and engage in QRPs when publishing a manuscript?” but replace the short and simplistic answer by:

“Many factors may contribute to the explanation, including different rules for publishing (e.g., word limits), evaluation (e.g., different emphasis on statistically significant results), performance pressure, etc. More research is needed into possible explanations of the difference in use of RRPs and QRPs in theses and published manuscripts.” (lines 684-687).

Line 698: You mention that resits were included in your sample as a limitation but you don’t mention that your sample is quite biased towards good theses, given that they all passed. It adds a complication to your interpretation of the central research question, given that the theses included in your sample may have been lower on QRPs in the first place - there’s simply no way to know. It’s not a deal breaker and you can still run a meaningful analysis based on the passing grades, but this should be highlighted as a limitation.

Very few theses fail. And failing theses usually pass in the resit. We do not have accurate data on the percentage of theses that fail, either before or after the resit, but this percentage after the resit is very likely below 5%. It is unlikely that much bias can result from excluding 5% of the data, that is. We added the sentence:

“Although the omission of (initially or finally) insufficient theses may have led to bias in our results, potential bias is likely small as the large majority of theses is rated as sufficient.” (lines 730-732).

Finally, it appears that there were two research assistants who seemed to have done a substantial amount of work, including developing the coding scheme and double coding the studies. Giving them the opportunity to review the final version of the manuscript would have qualified them for authorship, yet they are not listed as authors. It’s important that individuals involved in research are given appropriate credit, especially ECRs and trainees.

The coding scheme was solely developed by the first and last author. The assistants indeed provided input for the coding scheme but did nothing more than this (and large part of the coding). See Section 2.3 for the description of the development and implementation of the coding protocol.

We agree that all individuals involved in research should be given appropriate credit, and the assistants in our study get all the credits they deserve based on their work.

Reviewer 2 Report

Comments and Suggestions for Authors

See attached report. Thank you for the opportunity to review this manuscript.

Comments for author File: Comments.pdf

Author Response

Here is our detailed response to the comments of the reviewers:

Reviewer 2

Summary

This paper examines the prevalence of “responsible research practices” (RRPs) and “questionable research practices” (QRPs) in thesis research conducted by psychology Master’s students. The authors thoroughly review 300 theses submitted by psychology master’s students who successfully completed their degree at Tilburg University in the Netherlands between 2017 and 2020, coding up their characteristics, and analyzing the features of the thesis manuscripts, emphasizing features they categorize as RRPs and QRPs. The authors assess similarities and differences in the characteristics of their sample of 300 theses and past reviews of published research in psychology or graduate student work from other institutions. The authors interpret these trends, concluding that “Apparently, students were taught how to conduct research and how to write theses in a more responsible way than researchers behave when they write their own scientific manuscript” and suggest paths for future work on the topic.

I found the paper and the topic interesting, but a few elements could be strengthened or clarified to improve its contribution. I enjoyed reviewing this submission, and my comments are below, divided between main comments and minor points.

Thanks.

Main Comments

Abstract; p.16; elsewhere — throughout the manuscript, there is a tension between the construct the authors are able to measure, versus the underlying construct they are trying to understand. Is the goal to draw conclusions about the relative prevalence of RRPs and QRPs among recent psychology master’s students against A) past cohorts of psychology master’s students or B) similar recent peer-reviewed published research or C) historical practices in psychological research? All three? The work the authors cite are better suited to answer some of these questions than others—and the authors should be clear about what we can learn from the different comparisons they make, without overgeneralizing. In the ideal case, the authors might want to compare the characteristics of theses produced by psychology Master’s students against the some notion of “the same type of research published in peer reviewed journals at the time.” None of the evidence they cite is well-suited to this task, but the 2024 update to Hardwicke et al. (2021) provides the closest “apples-to-apples” comparison (or better yet, a road map re: how the authors might generate such a base of evidence). The manuscript should be revised to be clear about the lessons from different comparisons (e.g., “compared with summaries of published work in psychology between 1990 and 2007...” is very different than “compared with a random sample of empirical psychology papers published in 2022” but right now, everything reads “compared with published psychology literature” or “compared with published psychology research”). In the manuscript, the authors compare their sample of 300 Tilburg University psychology theses from 2017-2020 versus:
a sample of 250 University of Vienna psychology theses from 2000-2016 (Olsen et al., 2019)
A reasonable test of “how do the theses of recent psychology graduate students compare with past cohorts”
a sample of 207 psychology theses from 40 public German universities completed between 2014 and 2016 (Krishna & Peter, 2018)
A reasonable test of “how do the theses of recent psychology graduate students in Tilburg compare with the theses of recent psychology graduate students from across Germany”
a random sample of 250 psychology articles published between 2014-17 (Hardwicke et al., 2021)
A reasonable test of “how do theses of recent psychology graduate students from Tilburg compare with a random sample of recent published psychology articles*”
This is the closest comparison group to “similar recent peer-reviewed research” but there are a few caveats:
Adoption of RRPs like pre-registration literally was growing exponentially during this period (See Figure 4 of Nosek et al., 2019 or the authors own citation of Bakker et al., 2020)
Nosek, B. A., Hardwicke, T. E., Moshontz, H., Allard, A., Corker, K. S., Dreber, A., ... & Vazire, S. (2022). Replicability, robustness, and reproducibility in psychological science. Annual review of psychology, 73(1), 719-748.
The sampling frame is all articles published on scopus with an ASJC code related to psychology: 40/228 articles in the Hardwicke et al. (2021) studies were classified as “no empirical data,” making this an importantly different comparison group relative to the theses, which seemed to emphasize (require? See comment 2 below) some form of data analysis.
Hardwicke and colleagues have updated this analysis using more recent data and limiting the sampling frame to “empirical articles” (and given publishing delays, one could argue that 2022 publications are a reasonable comparison group for research first drafted between 2017-2020). In my opinion, this update is closer to an “apples-to-apples” comparison than the 2021 article.
Hardwicke, T. E., Thibault, R. T., Clarke, B., Moodie, N., Crüwell, S., Schiavone, S. R., ... & Vazire, S. (2024). Prevalence of transparent research practices in psychology: A cross-sectional study of empirical articles published in 2022. Advances in Methods and Practices in Psychological Science, 7(4), 25152459241283477.

Thanks for pointing out this paper. It is highly relevant to our paper, and we refer to it now.

a summary of meta-analyses published between 1990 and 2007 (Fritz et al., 2013)
A reasonable test of “how do theses of recent psychology graduate students from Tilburg compare with historical research practices in psychology (before the replication crisis)”
a random sample of psychology articles published in 2007
A reasonable test of “how do theses of recent psychology graduate students from Tilburg compare with historical research practices in psychology (before the replication crisis)”
a sample of 44 articles published in the Journal of Educational Psychology published in 1963 (22 articles) and 1983 (22 articles) (Swales and Najjar, 1987)
This is not a useful comparison/representation of any literature and should be replaced with a better source.

We thank the reviewer for pointing this out. We were interested in the prevalence of RRPs and QRPs among recent psychology master’s students’ theses, as stated in the abstract and in the main research question. And we discuss how this GENERALLY compares to the published literature. We had no specific reference group in mind. If we had a specific reference group in mind, we would have incorporated that in the design of our study. For instance, by also trying to tally the prevalence of RRPs and QRPs in this reference group of published studies (e.g., published papers in journals X in fields Y of psychology in time period Z). This was not our intention, and our project was already daunting as it was (i.e., extremely time-intensive).

Your comment stimulated us to be more modest with respect to our statements and conclusions. Also, because “the published literature” is very general, and likely very large differences may be present with respect to the prevalence of RRPs and QRPs within the published literature.

This led to the following changes:

No changes in the abstract, we keep the reference group implicit/general (“Compared to authors of published scientific manuscripts…”).
When interpreting our findings in the first paragraph of the discussion, we now state “Relating our findings to what we know about published manuscripts in the social sciences in general, …”
We added the following paragraph to the discussion section:

“A fifth limitation concerns the interpretation of our findings when comparing them to the published literature in the social sciences. Our comparisons include results of articles on student theses and on published manuscripts in different fields and different time-points further in the past. We realize that researcher and master student behavior differs across fields and may improve over the years. If comparing their manuscripts is the goal, then we recommend systematically comparing theses to published manuscripts in the same field and year(s).” (lines 739-745).

Abstract; p.16; elsewhere — throughout the manuscript, there is a tension between the construct the authors are able to measure, versus the underlying construct they are trying to understand. Is the goal to draw conclusions about the relative prevalence of RRPs and QRPs among recent psychology master’s students against A) past cohorts of psychology master’s students or B) similar recent peer-reviewed published research or C) historical practices in psychological research? All three?

Our main research question is stated in our paper as “What evidence of RRPs or QRPs do psychology master’s theses show, and what do their supervisors reward when they grade this final exam?” (lines 73-74). Only in our discussion we compare it to “published research in general”. See our response to your previous comment.

The manuscript would benefit from greater insight into the Master’s thesis assignment. Can the authors provide instructions that were given to students from representative programs/cohorts as appendix materials? It would be helpful to understand the extent to which RRPs were actively encouraged (or QRPs discouraged) in the assignment itself versus internalized via coursework. Were students provided with rubrics or other guidelines that might have promoted or discouraged certain research practices? These are relevant to the interpretation of findings.

Good point. We uploaded the English student guides of the master thesis of the social psychology and clinical psychology programs to OSF (https://osf.io/b4g32/). The clinical psychology master has four tracks and only one of them is in English, but the student guides of these four tracks are obviously almost identical with respect to contents, hence we uploaded all relevant student guides. The student guides mainly explain the process of writing and supervision, and do not hint at using QRPs when writing.

We added “We uploaded the English student guides for the master programs to the OSF page.” in Section 2.3 on the coding protocol where we also describe the types of master theses.

Some of the RRPs the authors discuss (e.g., power analyses) might be found in a pre-analysis plan or pre-registration but omitted from a published manuscript.

Yes, but we examine theses of master students and not published manuscripts. Only very few theses were preregistered (11 or 3.7%, see Table 2).

p.16, line 584; Table 4 would benefit from 1) organizing features as RRPs, QRPs, or Neutral and 2) running joint tests within groups (i.e., do RRPs jointly predict grades? What about QRPs or Neutral features?)

Thank you. To follow upon your and Reviewer 1’s suggestion, we also ran multiple regression analyses with all 18 predictors together as well as regression analyses assessing the (joint) effects of thesis characteristics of type ‘Neutral’, ‘QRP’, RRP’, and ‘hypothesis’. The type ‘hypothesis’ includes five thesis characteristics that all involve hypotheses but could not be grouped into one of the other types. We included the following text in our manuscript, just above Table 4 when describing the results predicting thesis grade:

The Fritz et al., 2013 and Kühberger et al., 2014 papers are a strange benchmark for “the published literature” without being clear that you are comparing current work to historical peer-reviewed work that was executed before the replication crisis. This is somewhat implied by the 2013 and 2014 publication dates, but “these master’s students’ work is different from what people were doing between 1990 and 2007 (Fritz et al., 2013)” or “these master’s students’ work is different from what people were doing in 2007 (Kühberger et al., 2014)” is subtly but importantly different from “these master’s students are more likely to adopt RRPs than similarly situated scholars submitting peer-reviewed work,” as is implied/suggested in the discussion (see, e.g., 606-614).

We removed one reference to Fritz et al. in the introduction. We added the reference to Szucs & Ioannidis (2017) when stating that:

“the statistical power of statistical tests in students’ master theses was higher than the power typically reported in published manuscripts, assuming that true effects, study designs and significance levels do not differ (Fritz et al., 2013; Kühberger et al., 2014; Szucs & Ioannidis, 2017).” (lines 615-618).

This literature is not a benchmark but merely showing that master’s theses fare better than published manuscripts as evidenced by snapshots in time and fields. We address this as a limitation of these comparisons in the discussion section (the new paragraph, discussing the fifth limitation).

p.17, lines 640-645; This is a nice contribution that could be expanded upon/highlighted.

Yes, we agree. This finding baffled us. We expanded it by adding the sentence:

“Hence, a unidimensional construct of “responsible scientific behavior” likely does not exist, going against the intuition of having both scientifically irresponsible scientists on the one hand and responsible ones on the other hand.” (lines 666-670).

And to the abstract we added as a last sentence:

“We also found no relationships among practices, suggesting that there is no unidimensional construct of “responsible scientific behavior”.

pp.16-17, line 594, lines 623-628: “assuming true effects, study designs, and significance levels do not differ” is not a trivial assumption—in particular, the focal hypothesis in the modal thesis is either a test of moderation or mediation, not a main effect. How does this affect how we should think about the relevant effect size benchmarks?

As we wrote earlier in response to your main comment, we have no clear ‘reference group’. Depending on the reference group of published research, the true effects, study designs, and type of effects will differ. Hence, we cannot make general statements that always hold.

In the paragraph that includes the statement you refer to, we are in our opinion careful with our statements by always adding “assuming…” or “if…”. Hence, we did not make changes here.

p.17, lines 649-651 — this should be revisited with a eye toward “apples-to-apples” comparisons—it is likely true from the evidence you cite that students today conduct more responsible research than their advisors were conducting in the 1990s. It is unclear how the research behaviors of students today compares with that of their advisors today.

For a systematic comparison, which was not our aim, other research is needed (see our earlier response, and the new paragraph in the discussion on the 5^th limitation).

After reading Thelwall & Mas-Bleda (2020), however, we believe that the differences with respect to research practices between on the one hand student theses and doctoral dissertations and on the other hand published manuscripts have been substantial for a long time.

Similarly, the text on lines 658-660 fully ignores the fact that the primary comparisons of published work pre-date the replication crisis—this is not an apples-to-apples comparison. If you want to make this kind of claim, back it up by analyzing the features of empirical psychology research today, not in 1990 or 2007. The Hardwicke et al. (2024) update provides a useful starting point.

Thanks for referring to Hardwicke et al. (2024), we incorporated it into our paper at several places.

p.18, lines 678-680 – I do not see the logical connection between here—revisit or revise this concluding sentence.

Lines 678-680 corresponded to “The attitude of the supervisor, and the type of mentoring have shown to affect the students’ attitude and behavior (Anderson et al., 2007; Gopalakrishna, ter Riet, et al., 2022; Gopalakrishna, Wicherts, et al., 2022; Krishna & Peter, 2018).” As these lines make sense in the context of both the preceding and subsequent text, we did not revise the text.

p.16, lines 589-590; the Swales & Najjer, 1987 citation is indefensible and the sentence is an inaccurate summary of the article. The article summarizes the presence of hypotheses in the introductions of articles published in a single journal “The Journal of Educational Psychology” from 2 years, 1963 and 1983, which is in no meaningful way a reflection of ‘the field of educational psychology’ or a relevant benchmark for this sample. Revise.

We agree. We deleted the reference to this old work and replaced it with the reference to the meta-study of Thelwall and Mas-Bleda (2020). We now write:

“Students explicitly listed their hypotheses in 69% of the theses, which is much higher than the prevalence of the word ‘hypothesis’ (about 33%) or ‘research question’ (about 19%) in the main body of the text of papers in the social sciences (Thelwall & Mas-Bleda, 2020). Thelwall and Mas-Bleda (2020, p.744) also report that hypotheses and research questions were stated much more frequently in U.S. doctoral dissertations than in research articles, in several fields.” (lines 607-612).

Minor Comments

Abstract: The first sentence notes that there are large, open questions about the substance of graduate training in psychology—“[…] it is unclear what psychology master’s students are taught[…]”—suggesting that this will feature in the paper, but there is no discussion of the graduate coursework or training at Tilburg University with respect to RRPs/QRPs in the manuscript. I suggest reworking the opening sentence to focus on the

Agreed. We changed the end of the first sentence to “… but it is unclear which research practices psychology students engage in when graduating their master’s program.”

P.2, line 89; Typo: Hardwicke et al., 2021 is cited as Hardwicke et al., 2022

Thanks. Corrected.

P.10, line 381; Are all the confidence intervals 95% CIs? If not, how did the coders handle studies that used other conventions? For example, this paper sets a much higher threshold for statistical significance for the tests in Table 4.

We do not know if all CIs are 95% and added ‘(of any width)’ in the text.

We believe that the reader can choose his/her own threshold, based on the values of the correlations together with the significance levels .005 and .05 in Table 3, and the p-values in Table 4.

P.13, line 411; The authors say “most theses” when they are talking about fewer than half (47.3%)—I would suggest “nearly half of the theses” or “roughly half” as most generally implies >50%.

Oops, this was a clear error on our side. Thanks. We corrected it to “nearly half”

pp.14-15, Section 3.2; the discussion of “large” and “moderate” correlations in section 3.2 struck me as a bit odd. What benchmarks are you using to delineate whether a correlation is “large”, “moderate”, or “small”?

Common benchmarks are .1, .3, .5, for small, medium, large correlations. We added this to the text.

pp. 15-16, Tables 3 and 4; the conventions for statistical significance seem arbitrary—why mark 0.05 and 0.005 in Table 3, but only 0.005 in Table 4? Why not mark 0.05, 0.01, and 0.001 in all three? That seems more conventional to me, but perhaps it varies by discipline or subdiscipline? An explanation would be helpful.

Yes, it is all arbitrary. Readers can pick their own significance level using the p-values in Table 4.

To be consistent, we also added the .05 significance level to Table 4. And we added to the text to interpret the findings of Table 4: “Using α = .005 to adjust for multiple testing…”

p.4, line 178 & p.17, line 635: NHST is not defined in the manuscript

Thanks. We no longer use the abbreviation in the paper.

Reviewer 3 Report

Comments and Suggestions for Authors

In this manuscript, the authors present a descriptive study of 300 Master's theses in psychology programs situated in a Dutch university. In general, I find that this is an important topic that is approached in a methodologically rigorous fashion. I applaud the work put in by the authors in conducting this research and in reporting it transparently. I have some minor to moderate concerns that I list in the following, but in general, I am positive about this manuscript and believe it will be a worthy addition to the literature.

POINTS OF CRITICISM (presented chronologically)

I'm slightly concerned about some of the implicit conclusions in the theory section, although these are mostly minor points. First, the citation of Hardwicke et al. (2022; ls. 87-89) seems to be used to support an implicit claim that preregistration is not common; however, the time period sampled (2014-2017) is likely unrepresentative of modern practices (given the speed of development of RRPs in psychology). Second, the conclusion that many studies may be underpowered (ls. 113-116) is likely premature, as a between-subjects design is not necessarily prototypical and the data underlying the statement are roughly a decade old. This is nitpicky, but I'm always quite sensitive to what I perceive as overly pessimistic readings of the state of the art, because I personally am convinced that there are negative psychological effects on researchers (see e.g. Motyl et al., 2017). Particularly the former assumption (of a between-subjects design) is carried forward at multiple points in the manuscript; I feel this should be tempered, as many studies are fully within-subjects and may thus be better powered than might be concluded from the manuscript.

I would appreciate a mention of the issues with power analysis that often preclude their use, such as the lack of a credible effect size estimate for the given paradigm or the use of statistical analyses for which power analysis is infeasible (e.g., multilevel models). The degree to which one accepts or denies the existence of these issues is important for the degree to which one considers power analysis reporting normative - while it is certainly an RRP, I would contend it is only an RRP when it is applicable to the study, and further that a large proportion of studies would NOT have an applicable power analysis. While I accept that the normativity of power analysis is an open question in the field, I think that this controversy should be briefly discussed.

I wonder whether the choice to include all publicly available theses but not all confidential theses constitutes a selection bias. This might be the case if, for example, confidentiality is employed as a blanket "fig leaf" for not needing to share data in a research environment that is hostile to open practices, in which case one would expect higher QRP and lower RRP rates in the confidential theses. Similar considerations apply to the missing theses; is it possible that theses employing worse practices are systematically more likely to be missing? Without deeper knowledge of the background of how theses are normally made public at Tilburg University, this is impossible to judge. I therefore encourage the authors to discuss this point in somewhat more detail and potentially explicitly acknowledge possible selection biases. I also want to briefly emphasize that I in NO WAY want to accuse researchers who habitually work with confidential data of any malfeasance; I simply wanted to give a strong (hypothetical) example of how bias might result from this selection rule.

In the discussion, comparing to Swales & Najjar without acknowledging that their data is roughly 40 years old (!!!) strikes me as inappropriate. I'm not sure whether that comparison should be made at all; if it is, it certainly must be noted that this is a "best guess" number in the absence of more up-to-date data.

I wonder whether the authors feel that an analysis of the RRP/QRP rate as a function of cohort year might be of interest. One could argue that the question of RRP/QRP use in student theses in particular was coming into stronger focus in the time period covered by the sampled theses (2017-2020), considering the publication dates of many of the authors' cited works. I think it might be quite interesting to identify any hints of a trend, both from that perspective and for the descriptive question "are things getting better?". However, if the authors feel that characteristics of their dataset preclude such an analysis, I would be interested to hear their reasoning.

Signed,

Author Response

Here is our detailed response to the comments of the reviewers:

We thank the reviewer for the positive evaluation of our manuscript.

POINTS OF CRITICISM (presented chronologically)

We corrected the reference to Hardwicke et al., which was published in 2021 rather than 2022. And we added the reference to Hardwicke et al. (2024), who found that the rate of preregistration has indeed increased over time until 2022, but is still relatively low (about 7% in the field of psychology as a whole).

Second, the conclusion that many studies may be underpowered (ls. 113-116) is likely premature, as a between-subjects design is not necessarily prototypical and the data underlying the statement are roughly a decade old. This is nitpicky, but I'm always quite sensitive to what I perceive as overly pessimistic readings of the state of the art, because I personally am convinced that there are negative psychological effects on researchers (see e.g. Motyl et al., 2017). Particularly the former assumption (of a between-subjects design) is carried forward at multiple points in the manuscript; I feel this should be tempered, as many studies are fully within-subjects and may thus be better powered than might be concluded from the manuscript.

Thanks. We always add the context of the power analysis, which is the independent sample t-test and a between-subjects design. We changed the sentence to:

“However, the median sample size reported in published articles in psychology is only 62 (Hartgerink et al., 2017), which corresponds to a median power of .39, using the same independent t-test, suggesting that many studies would have been severely underpowered were they using a between-subjects design (Bakker et al., 2012; Szucs & Ioannidis, 2017).” (lines 118-122).

Power analysis for multilevel models is well feasible, either using the effective sample size and the ICC, or using many assumptions in combination with statistical software like PiNT.

Yes, good point - we also agree that including a power-analysis is not RPP per se; the power-analysis should also be appropriate for the research. At the start of the paragraph discussing RRP power-analysis, we added “appropriate” to “power analysis”, and added as a last sentence:

“Note, however, that the research cited here and ours did not check whether the reported power analysis was appropriate for the research.” (lines 139-141).

Yes, we understand your reasoning. We do not believe there is a difference, because – we are embarrassed to admit – it is mostly the department secretary or thesis coordinator who decided not to publish a full batch of theses, although they do not have the right to do so (i.e., the student is the owner of his/her own thesis, not the university). Anyway, we added the following text as the third limitation (as it remains a possibility, in theory):

“Thirdly, our results may be biased because of theses excluded. Many theses were ‘confidential’, and in theory they could differ with respect to QRPs and RRPs from available theses.” (lines 725-727).

We agree. We deleted the reference to this old work and replaced it with the reference to the meta-study of Thelwall and Mas-Bleda (2020). We now write:

Thanks for the suggestion. We think the time span is very short for expecting and finding a trend. That is, even if there is a trend, we believe the effect will be small for such a short time span and we would have little statistical power to detect such a small effect.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

see attached report. Thank you for the opportunity to review this revision.

Comments for author File: Comments.pdf

Author Response

Response to reviewers

We thank Reviewer 2 for their second round of constructive feedback. Below we respond to all comments in italics in detail. When we changed the text of our manuscript, we typically added the changed text in our response letter.

Reviewer 2

This paper examines the prevalence of “responsible research practices” (RRPs) and “questionable research practices” (QRPs) in thesis research conducted by psychology Master’s students at Tilburg University. The authors revisions are largely responsive to the referee reviews, but a few small points merit further attention.

I remain concerned that in some sections of the manuscript, the construct the authors are able to measure (the extent of RRP adoption among current Master’s students relative to historical norms in Psychology) is not reflected in the authors’ language. Looseness about comparisons with “published manuscripts” or “the published literature” leads the authors to overreach in some of their claims and minimize or ignore conflicting contemporary evidence. I would prefer that the manuscript was clearer about the fact that the comparisons in rates of RRP adoption vs. published work are primarily relative to historical work (i.e., prior to the replication crisis).

This is not true. When interpreting our findings by related them to previous work, which we only do in our discussion section (until about line 715), we refer to studies from 2010, 2012, 2014, 2015, 2016 (twice), 2017 (twice), 2018 (twice), 2019, 2020, 2021, 2022 (twice). Many researchers label 2012 as the start of the reproducibility crisis in psychology. All references are to recent studies, and most after 2012. We consider the work of Thelwall & Mas-Bleda (2020) as the most relevant, and note that this study is also well after the onset of the replication crisis.

To reiterate my prior comment, “The work the authors cite (or use as benchmarks) are better suited to answer some […] questions than others—and the authors should be clear about what we can learn from the different comparisons they make, without overgeneralizing.”

Our research question is “What evidence of RRPs or QRPs do psychology master’s theses show, and what do their supervisors reward when they grade this final exam?” Comparisons with ‘ published manuscripts’ or ‘ the published literature’ only serve the purpose of interpreting our findings, and we only do so in our discussion section. In our first revision we also added:

“A fifth limitation concerns the interpretation of our findings when comparing them to the published literature in the social sciences. Our comparisons include results of ar-ticles on student theses and on published manuscripts in different fields and different time-points further in the past.. We realize that researcher and master student behavior differs across fields and may improve over the years. If comparing their manuscripts is the goal, then we recommend systematically comparing theses to published manuscripts in the same field and year(s).”

Hence, we respectfully disagree with Reviewer 2 that we are overgeneralizing.

The authors remain analytically loose in their discussion of the implications of their finding that current Master’s students make greater use of RRPs than historical research practices in psychology (mostly before the replication crisis). For example., on p.18 (lines 749-785), the authors conclude that the fact that Tilburg Master’s students make greater use of RRPs implies “[…]there is no or just a weak association between a researcher’s behavior and what this researcher values in theses.” This is a strong claim, that in my opinion that is outside the scope of what the data supports. To test this hypothesis, the appropriate benchmark would be a measure of RRPs in the graders’ own contemporaneous research, not the historical behavior of the entire field of psychology. I strongly suggest that the authors revisit this paragraph (beginning with “Master’s theses showed more evidence of RRPs[…]” and ending with “[…]what this research values.” and revise/qualify the claims. The thought experiment (“It would therefore be interesting to to see[…]”) is fine, but unless I am mistaken, the concluding sentence suggests you have evidence that the authors simply don’t have.

Here, we agree with Reviewer 2, and we thank Reviewer 2 for their suggestion. We deleted the complete last sentence.

Furthermore, Christensen and coauthors’ 2020 survey of PhD students and published authors in psychology shows limited differences in attitudes toward Open Science practices between published authors and PhD students (this can be gleaned from a comparison of the “Psychology” plots in Appendix Figures 6 and 7), casting further doubt on the notion that the absence of RRPs in historical scholarship is a reflection of active scholars refusing to adopt Open Science principles in their published work at higher rates than graduate students.

Christensen, G., Wang, Z., Levy Paluck, E., Swanson, N., Birke, D., Miguel, E., & Littman, R. (2020). Open science practices are on the rise: The state of social science (3S) survey. https://escholarship.org/content/qt0hx0207r/qt0hx0207r.pdf

We thank Reviewer 2 for the reference. In agreement with this, only very recently one of our PhD students found no relationship between researcher experience of corresponding authors and adoption of open science practices in simulation studies.

I would prefer that even in the abstract and introduction, the authors qualify their claim that “[…]compared to authors of published scientific manuscripts[…]” by adding something like “[…]compared to historical samples of published scientific manuscripts in applied psychology[…]” or ““[…]compared to historical samples of published scientific manuscripts in applied psychology from [YEAR] to [YEAR][…]” or even “[…]compared to samples of published psychological science manuscripts[…]”, as someone who reads the abstract could walk away believing that the article IS making comparisons to a representative set of published science or contemporaneous published work, a claim the authors agree is not justified by their data and comparison groups.

We did not change the abstract. With this sentence:

“In this study, we documented the prevalence of responsible- and questionable research practices in 300 psychology master’s theses from Tilburg University, the Netherlands, and associated these practices to supervisor’s grading of the theses.”_

the abstract is clear about the methodology of our study, and it is also clear that we do not empirically compare theses to published manuscripts. Only careless reading would someone make believe that we empirically compare to published manuscripts.

The introduction explains our research question (“What evidence of RRPs or QRPs do psychology master’s theses show, and what do their supervisors reward when they grade this final exam?”), which is not about comparing to published manuscripts. The introduction also explains several RRPs and QRPs. Only in our discussion we compare theses to published manuscripts.

To summarize, we did not feel the need to change the abstract or introduction as a response to this comment of Reviewer 2.

p.16, line 656; the Thewall and Mas-Blenda (2020) article is a considerable improvement on the Swales and Najjar (1987) piece, but one could argue that the relevant comparison group for Tilburg U Psychology Master’s students is not “Social Sciences”, but “Psychology and Cognitive Sciences,” where, during the 2014-2018 period Thewall and Mas-Blenda study, over 60% of articles mention a hypothesis (see Figure 5). The exact number cannot be deduced from the figure, but the data is posted on FigShare: https://figshare.com/articles/dataset/Research_Questions_and_Explicit_Purposes_in_Journal_Articles/10274012

We fear that Reviewer 2 misinterpreted Thelwall and Mas-Bleda (2020). The articles did not mention a hypothesis, as argue by Reviewer 2, but (citing Thelwall and Mas-Bleda, first sentence of section 4.2 on p.739): “The terms “hypothesis” and “hypotheses” are common in Psychology and Cognitive Science” . Indeed, many articles in psychology use the term hypothesis, for instance, in sentences like ‘ we reject the null hypothesis’ or ‘ we support the hypothesis’. But these articles generally do not explicitly list their hypotheses. This is also stated explicitly by Thelwall and Mas-Bleda, immediately below the previously cited sentence:

“The terms can be used to discuss statistical results from other papers and in philosophy and mathematics they can be used to frame arguments, so not all matches relate to an article’s main purpose, and only 28% of the random sample checked used the terms to refer to the articles’ main hypothesis or hypotheses.” (2020, p.739).

Note that 28% of, say 65%, is merely 18%, which is much lower than in the theses of master students.

Of course, we must be careful with interpreting this finding, but we are careful (see also our fifth limitation).

Yes, we looked at the data, great supplement of Thelwall and Mas-Bleda.

Given that the Thewall and Mas-Blenda review provides an opportunity to make more of an “apples-to-apples” comparison with a contemporary sample of research in a similar domain, I would suggest the authors revisit this comparison and engage with the implications (i.e., the contrast in adoption of basic RRPs like stating a hypothesis between Master’s students and published work is not as stark today in Psychology as in the past). It is fine—and would enhance the manuscript—to acknowledge this fact and wrestle with the implications.

We did not change our comparison, as we believe Reviewer 2 misinterpreted the results of Thelwall and Mas-Bleda. See our response to the previous comment.

A very minor style suggestion: In the title: “Assessing Questionable- and Responsible Research Practices in Psychology Master’s Theses” why is there a dash after “Questionable-” ? I think neither “Questionable” nor “Responsible” should be followed by a dash, but I think a dash should follow neither adjective or both adjectives.

Yes, you are right, thanks for spotting this.

Typo: Line 686, missing period after “fields

Thanks again, we added the period.

Round 3

Reviewer 2 Report

Comments and Suggestions for Authors

Thank you for the opportunity to review this second revision. The authors' latest revision is largely responsive to the second round of referee reports. The authors engaged with my constructive criticism/suggestions in good faith, and while disagreements remain, I have no further major comments and congratulate the authors on a fine contribution.

I maintain that the fact that most of the *citations* are post-2012 does not resolve concerns related identifying an “apples-to-apples” comparison group of published articles, as many of these publications examine historical rather than contemporaneous scholarship. I would prefer that the authors further qualify some of their claims, but I do not see the manuscripts claims as misleading in their current form.

I revisited the Thelwall and Mas-Blenda (2020) article in light of the authors’ comments, and I do not see a dissonance between my assertion that “Psychology & Cognitive Science” is a better benchmark than “Social Sciences” regarding the prevalence of RRPs and QRPs in their sample and the content of the article. The authors already use the raw reported prevalence from Figure 5 (their reference to “about 33%” on line 657) in their article. If the authors believe that the prevalence of the word “hypothesis” is a poor benchmark for RRPs in the Psychology & Cognitive Sciences (for the reasons they describe), I do not understand why it is an appropriate benchmark for the Social Sciences, but this is a minor point in the article. I encourage the authors to carefully consider 1) whether the measures in Thelwall & Mas-Blenda’s (2020) article (and Figure 5 in particular) are indeed useful heuristics for the reader and 2) whether it is more appropriate to compare the practices of their sample to all Social Science articles considered by Thelwall and Mas-Blenda versus just the Psychology and Cognitive Science articles.

Again, I don't think that grappling with the fact that standards for psychological research have changed over the past 15 years is a threat to the project—rather I think that engaging further with this reality and its implications would strengthen the piece—but I am comfortable with the authors’ considered view that the current manuscript accurately expresses their views on this topic in sufficient depth.

Author Response

Please see the attached file.

Author Response File: Author Response.pdf

Article Menu

Assessing Questionable and Responsible Research Practices in Psychology Master’s Theses

Further Information

Guidelines

MDPI Initiatives

Follow MDPI