# dexter: An R Package to Manage and Analyze Test Data

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

- Conditional maximum likelihood (CML) estimation is very fast and stable, and does not need any distributional assumptions about the trait being measured;
- We have access to powerful diagnostic tools of item fit based on observable data; this avoids the circularity of judgement that may arise if we test the fit of a model based on ability estimates from the model;
- Multi-stage tests with routing rules based on sum scores can be estimated with CML as in the companion package, dexterMST, with no or very little item exposure prior to the actual testing; this makes it possible to introduce adaptivity in high stakes situations.

## 2. Using the Package

- Functions to start, open, or close a dexter project, input data, define person and item properties, or get information on booklets, items, persons, test design, scoring rules, etc;
- Functions to evaluate item quality and test reliability. These are typically applied per booklet and include
`tia_tables`,`distractor_plot`,`fit_inter`, and`fit_domains`; - A single function,
`fit_enorm`, to ‘calibrate’ the test, i.e., estimate item parameters for the test as a whole; - Functions to estimate person proficiency. These fall into two groups: functions such as
`ability`and`ability_tables`will be more useful when dealing with high stakes tests, while functions such as`plausible_values`and`plausible_scores`are more adapted for (large scale) survey research. Function`individual_differences`, which provides a formal test against the hypothesis that all persons have the same latent ability, also belongs in this group; - Functions that deal with interactions between person and/or item properties, e.g.,
`DIF`,`profiles`,`latent_cor`; - A variety of functions grouped under the name ‘functions of theta’ compute expected scores (
`expected_score`) or test and item information functions (`information`), simulate responses (`r_score`), and so on; - All other functions, for example those providing support for standard setting.

#### 2.1. Data Entry

`start_new_project`, which requires the user to provide an exhaustive list of all items in the test, all admissible responses, and the score that will be assigned to each response. This is simply a data frame with three columns:

`item_id`,

`response`, and

`item_score`(as in several other functions, the column names matter but the order is arbitrary). Creating it can be boring when the test is large, but the information will usually be available in some form and can be reshaped as necessary.

`touch_rules`.

`Unscored`matrix contains the original responses to 18 multiple choice items from 472 persons. To create a new database, we will turn the matrix into a data frame, apply the

`keys_to_rules`helper function to generate the scoring rules table, and pass that to function

`start_new_project`. The R function

`sprintf`is useful in creating sortable variable names. The data base will be saved to memory, which is specified with the special syntax

`:memory:`; for permanent storage, provide a file name instead. Once the project has been created, simply add the data with function

`add_booklet`, where

`’one’`is an arbitrary booklet ID that can be referenced in later function calls.

data = as.data.frame(irtoys::Unscored) names(data) = sprintf(’item%02d’,1:ncol(data)) keys = data.frame( item_id=names(data), noptions=4, key=c(2,3,1,1,4,1,2,1,2,3,3,4,3,4,2,2,4,3)) rules = dexter::keys_to_rules(keys, TRUE) db = dexter::start_new_project(rules, ’:memory:’) dexter::add_booklet(db, data, ’one’)

`start_new_project`function; any variables in the data frame that are not known items or declared person properties will be ignored quietly. Person properties can also be added later with the

`add_person_properties`function (an occasion where the user’s own person IDs might be handy), and there is a similar function to add item properties.

`add_booklet`repeatedly. dexter will deduce the test design, and functions

`get_design`and

`design_info`will provide information about it. In particular, the latter function will check whether the design is connected. In today’s practice, and especially with computer-based testing, the data may contain a large number of distinct booklets or already be in long shape; function

`add_response_data`(see below) will be more convenient in such cases.

#### 2.2. Tidy Data Structures, Querying, and Subsetting Data

`get_design`function, show how it looks, pivot the original data to long shape with the appropriate function from the tidyr package, and input the result into a new project.

ds = dexter::get_design(db) head(ds) # booklet_id item_id item_position # 1 one item01 1 # 2 one item02 2 # 3 one item03~3 data$person_id = 1:nrow(data) # must have person IDs for pivoting data = tidyr::pivot_longer(data, cols=1:18, names_to=’item_id’, values_to=’response’) data$booklet_id = ’one’ head(dat) # A tibble: 6 x 4 # person_id item_id response booklet_id # <int> <chr> <int> <chr> # 1 1 item01 2 one # 2 1 item02 3 one # 3 1 item03 1~one db = dexter::start_new_project(rules, ’:memory:’) dexter::add_response_data(db, data, design=ds, auto=TRUE)

`get_`return information about the booklets, the items and their properties (if any have been supplied), the persons and their properties, the scoring rules, or simply query the data base for responses or test scores. Of particular interest is the function

`get_variables`, which returns the list of all variables available for analysis, whether technical, item properties, or person properties. A great advantage of having the proper (tidy) data structures is that all variables, whether on the person or on the item side, are treated on equal basis and can be freely combined.

#### 2.3. Item and Test Diagnostics

`tia_tables`returns a list of data frames containing the well-known classical test theory (CTT) statistics at item and booklet level; these can be easily prettified with packages such as huxtable [18] or used in programming. We highly value visual tools to explore how items ‘behave’ and how well the calibration model fits the data. Basically, there are two of them: the distractor plot, which can be produced with the

`distractor_plot`function, and the item-total regressions obtained by applying the generic

`plot`function to the output of function

`fit_inter`. This is a phase of analysis where the companion package, dextergui, can be very handy because it combines tabular and graphical output into an easy-to-navigate whole.

`touch_rules`function. Furthermore, we expect all distractors to ‘work’, i.e., have some plausibility for examinees who do not know the correct response. The item on Figure 1 is badly written, because responses 3 and 4 are too obviously wrong for all examinees, effectively turning the item into a toss-a-coin affair for those of low ability.

`fit_inter`to the data for a specific booklet, and then we pass its output to the generic plot function. The plot compares three item-total regressions. The observed one, shown with light pink dots, is merely the proportion of correct responses to the item among persons with a given total score on the booklet. The thin black line represents the regression predicted by the enorm–the model that will be used to estimate the item parameters for the test as a whole, but here applied locally to just the data for the specific booklet. The thick gray line depicts a similar regression predicted by Haberman’s interaction model [19]. The two models are discussed in some mathematical detail in Section 3; here we provide a brief and intuitive description.

#### 2.4. IRT Analysis: Estimating the Calibration Model

`fit_enorm(db)`, $\overline{)\mathrm{Enter}}$, done. Additional flexibility can be achieved by using a predicate. We have already mentioned one possible use: to exclude items not reached from the calibration, but one can think of many others. By popular demand, there is the possibility to fix the parameters for some items to prespecified values. Last but not least, there is a choice between two estimation models: CML [20] or Bayesian estimation through a Gibbs sampler [21]. The function returns an object that is best passed directly and as a whole to other functions, notably the ones that estimate person parameters. Its structure depends slightly on the choice of estimation technique. When CML estimation is used, there is a single set of parameter estimates; if the user chooses Bayesian estimation instead, there will be a set of samples from the posterior distribution of the item parameters. Applied to the output object, the generic coef function will show the parameter estimates and their standard errors; in the case of Bayesian estimation, the output contains the posterior means, the posterior standard deviations, and the 95% highest posterior density intervals. Other statistics can be produced easily: for example, the posterior medians and the medians of absolute deviation (MAD) can be computed as:

m = dexter::fit_enorm(db, method=“Bayes”) pmed = coef(m, what=’posterior’) |> apply(2, median) pmad = coef(m, what=’posterior’) |> apply(2, mad)

`plot`will produce familiar-looking plots of item fit with the latent variable on the horizontal axis.

#### 2.5. IRT Analysis: Estimating Student Ability

`ability`and the

`ability_tables`functions, and for research one can use the

`plausible_values`and

`plausible_scores`function. However, before looking at them, why not ask the more basic question: what if there are no true individual differences in ability? This is similar to an IRT-based test as the reliability is zero, which can be performed with function

`individual_differences`:

dexter::individual_differences(db) # Chi-Square Test for the hypothesis that all respondents # have the same ability: # Chi-squared test for given probabilities with simulated p-value # (based on 2000 replicates) # X-squared = 328433, df = NA, p-value = 0.0004998

`ability`takes as minimum arguments a data source and a set of item parameters, and returns a data frame of four variables: the booklet ID, the person ID, the sum score, and the ability estimate. The default estimation method is MLE; the other choices are EAP (expected a posteriori) [23] or Warm’s weighted maximum likelihood estimator (WLE) [24].

`ability_tables`returns a data frame with the booklet ID, the (unequated) booklet sum score, the corresponding (equated) ability estimate, and the standard error of the latter. If grade reporting is done on equated scores, the correspondence can be established easily from the table.

`probability_to_pass`estimates the equivalent score in a target test, based on ideas from ROC analysis [25,26]. Use the generic coef method to extract the probability to pass for each booklet and score, and the generic plot function to display plots of the probabilities, sensitivity and specificity, and the ROC. This is another method unique to dexter and of interest beyond psychometrics because it extends the popular ROC analysis to situations where a large part of the data is missing by design.

- A common normal prior for all persons;
- A mixture of two normal distributions not related to any background variable but expected to accommodate many situations (asymmetry, heavy tails, etc.) more flexibly than a normal prior;
- A hierarchical normal prior with a group for each category of one or more nominal background variables.

`ability`and

`ability_tables`, the

`plausible_values`function can take the object returned by

`fit_enorm`as one of the parameters. If item parameters have been estimated by CML, the single set of estimates are held constant. A special feature in dexter that, to the best of our knowledge, is not available in any other package, is the ability to condition each draw from the posterior distribution of ability (i.e., each PV) on a different draw from the posterior distribution of item parameters. We believe that this approach better captures all sources of uncertainty inherent to the measurement.

`plausible_scores`function. In a design with planned missingness, it produces plausible scores for all items: the person’s actual scores on the administered items are retained, and scores for the ones not administered are predicted based on plausible values. Ref. [31] showed how plausible scores can be used to relax the IRT model in international surveys by applying the market basket approach.

#### 2.6. Measurement Invariance

`DIF`is more exploratory and useful when there are known groups (defined by a categorical person property) but no a priori hypotheses on the item side, while

`profile_plot`is applicable when both a person property (groups) and an item property (domains) are known in advance and of interest. Both functions are rather different from the techniques found in other packages, so we need to describe them in some detail.

`gender="unknown"`defines a default value that will be overwritten by the actual values. The output is an object that contains an overall DIF statistic and two square, skew-symmetric matrices,

`Delta_R`and

`DIF_pair`.

`Delta_R`is computed as the difference between the two square matrices of pairwise differences between the item difficulties, as estimated for men and women separately.

`DIF_pair`is obtained by dividing each element of

`Delta_R`by its estimated variance; thus, it is the matrix of standardized differences, or effect sizes.

dich = dexter::verbAggrRules # provided with the package dich$item_score[dich$item_score==2] = 1 # dichotomize the items db = dexter::start_new_project(dich, ":memory:", person_properties=list(gender="unknown")) dexter::add_booklet(db, verbAggrData, "agg") d = dexter::DIF(db, ’gender’) str(d) # List of 5 # $ DIF_overall :List of 3 # ..$ stat: num 68.8 # ..$ df : num 23 # ..$ p : num 1.86e-06 # $ DIF_pair : num [1:24, 1:24] 0 -0.819 1.005 1.43 1.347 ... # $ Delta_R : num [1:24, 1:24] 0 -0.391 0.484 0.688 0.632 ... # $ $ group_labels: chr [1:2] "Female" "Male" # $ items :’data.frame’: 24 obs. of 2 variables: # ..$ item_id : chr [1:24] "S1DoCurse" "S1DoScold" ... # ..$ item_score: int [1:24] 1 1 1 1 1 1 1 1 1 1 ... # - attr(*, "class")= chr [1:2] "DIF_stats" "list"

`Delta_R`) produced with the code below. The default clustering of rows and columns readily reveals two groups of items defined primarily by the mode of behavior: ‘Do’ vs. ‘Want’. The relative difficulty of Do and Want items is different for men and women.

rownames(d$Delta_R) = colnames(d$Delta_R) = d$items$item_id library(pheatmap) pheatmap::pheatmap(d$Delta_R)

`plot`function produces a somewhat different display as shown on Figure 4. This is based on the absolute values of the effect sizes (the skew-symmetric matrix becomes symmetric), and the color scheme is calibrated such that values between 0 and 1.96 are shown in shades of blue while those exceeding the critical value progress from yellow to red (the critical value can be changed with parameter

`alpha`). The plot is not clustered but it can be rearranged by passing the item IDs in any desired order: alphabetically, by some item property, sorted by cluster analysis etc.

print(d) # Test for DIF: Chi-square = 68.798, df = 23, p = < 0.0006

`profiles`and

`profile_plot`functions. These follow the logic of profile analysis as proposed by Verhelst [35] although we do not compute all the statistics therein. Profile analysis is an intuitive and robust diagnostic test, and it can be very helpful when the test is not perfectly unidimensional. For example, if we have items on algebra, geometry, and probability, it is our choice (and responsibility) whether to produce three unidimensional tests, a multidimensional test, or at least perform profile analysis as a kind of residuals analysis within the univariate test covering the three domains. Conditional on the overall sum score gained by the person, profile analysis estimates expected domain scores, which can be compared with the observed domain scores. The vector of observed domain scores is called the observed profile, the vector of expected domain scores is called the expected profile, and the vector of their differences is called the deviation profile. If the profiles are purely individual, the deviations can be expected to cancel when aggregating over teachers, schools, or countries; otherwise, they can provide useful information on systematic effects due to differences in teaching quality or policy.

dexter::add_item_properties(db, verbAggrProperties) f = dexter::fit_enorm(db) p = dexter::profiles(db, f, ’behavior’)

dexter::profiles(db, f, ’mode’) |> dplyr::inner_join(get_persons(db)) |> dplyr::group_by(gender,mode) |> dplyr::summarize(os=mean(domain_score), es=mean(expected_domain_score),dv=os-es) |> tidyr::pivot_wider(id_cols=’gender’,names_from=’mode’,values_from=’dv’) # gender Do Want # <chr> <dbl> <dbl> # 1 Female -0.191 0.191 # 2 Male 0.635 -0.635

`profile_plot`function to build a profile plot similar to the one shown on Figure 5. The two axes show the two domain scores while the gray lines join the points where the two domain scores add up to the same sum scores. For each person group, there is a stairlike line joining the modal (most frequent) combination of domain scores for the person group at each test score. This may sound convoluted, but a single glance at our example plot immediately reveals that, at any overall level of verbal aggressiveness, women “do” less than they “want” as compared to men.

## 3. Theory and Implementation

#### 3.1. The Extended Normal Response Model (enorm)

#### 3.2. Estimation

#### 3.3. Goodness of Fit and the Interaction Model

#### 3.4. Ability Estimation

**repeat**- draw an ability ${\theta}^{*}$ from the prior distribution
- simulate a response pattern $\mathbf{y}$ for someone with ability ${\theta}^{*}$
**until**${y}_{+}={x}_{+}$**return**${\theta}^{*}$

pvv = var(pv) sigma = sqrt(1/rgamma(1, shape=(m-1)/2, rate=((m-1)/2)*pvv)) pvm = mean(pv) mu = rnorm(1, pvm, sigma/sqrt(length(m))where

`pv`is the latest set of m PV generated. In practice, we update the priors following the multilevel approach in Ref. [44], Chapter 18 (see also the code p. 399f). Updating the mixture prior follows the logic as explained, e.g., in chapter 6 of [45].

#### 3.5. Other Features and Innovations

- Our specific approach to DIF, which focuses on item pairs rather than individual items, was introduced in Ref. [33]. We have illustrated it in some detail in the preceding section, so we direct the reader to the original paper for a more formal discussion;
- The use of plausible scores to relax the reliance on common IRT models in comparative research is discussed in Ref. [31] and demonstrated in the next section;
- There is a formal test of individual differences against the null hypothesis that all persons have the same latent ability; it is explained in a package vignette (see the previous section for an example);
- For tests with a defined pass-fail score, there is a novel equating method based on ROC analysis [25]; again, there is a detailed vignette;
- Function
`latent_cor`estimates correlations between latent traits within a dexter project; use an item property to specify the items belonging to each scale; - A particularly promising application of the models in dexter concerns multi-stage tests with predefined routing rules on observed scores. The theoretical foundations are discussed by Zwitser and Maris [46], and the implementation is in the companion package, dexterMST [47], which also contains a detailed vignette. The major advantage of this approach to adaptivity is that it circumvents many inherent properties of computerized adaptive testing (CAT) that are less attractive in high stakes situations—in particular, unwanted item exposure in both pre-calibration and administration. In contrast, the multi-stage tests implemented in dexterMST are calibrated similar to linear tests, requiring minimal pre-testing.

## 4. A More Extensive Example

library(dexter) library(dplyr) library(tidyr) library(readr) library(SAScii) oecd = "http://www.oecd.org/pisa/pisaproducts/" url1 = paste0(oecd, "INT_COG12_S_DEC03.zip") url2 = paste0(oecd, "PISA2012_SAS_scored_cognitive_item.sas") zipfile = tempfile(fileext=’.zip’) utils::download.file(url1, zipfile) fname = utils::unzip(zipfile, list=TRUE)$Name[1] utils::unzip(zipfile, files = fname, overwrite=TRUE) unlink(zipfile) # erase from~disk dict_scored = SAScii::parse.SAScii(sas_ri = url2)

data_scored = readr::read_fwf( file = fname, col_positions = fwf_widths(dict_scored$width, col_names = dict_scored$varname)) |> dplyr::select(CNT, SCHOOLID, STIDSTD, BOOKID, starts_with(’PM’)) unlink(fname) # erase from~disk data_scored$BOOKID = sprintf(’B%02d’, data_scored$BOOKID) data_scored[data_scored==7] = NA data_scored[data_scored==8] = NA

`rules`object, which is passed to the

`start_new_project`function.

rules = tidyr::gather(data_scored, key=’item_id’, value=’response’, starts_with(’PM’)) |> dplyr::distinct(item_id, response) |> dplyr::mutate(item_score = ifelse(is.na(response), 0, response)) db = dexter::start_new_project(rules, "pisa2012.db", person_properties=list( cnt = ’<unknown country>’, schoolid = ’<unknown country>’, stidstd = ’<unknown student>’ ) )

for(bkdata in split(data_scored, data_scored$BOOKID)) { # remove columns that only have NA values bkrsp = bkdata[,apply(bkdata,2,function(x) !all(is.na(x)))] dexter::add_booklet(db, bkrsp, booklet_id = bkdata$BOOKID[1]) } rm(data_scored)

item_by_cnt = dexter::get_responses(db, columns=c(’item_id’, ’cnt’)) |> dplyr::distinct() market_basket = Reduce(intersect, split(item_by_cnt$item_id, item_by_cnt$cnt)) dexter::add_item_properties(db, dplyr::tibble(item_id = market_basket, in_basket = 1), default_values = list(in_basket = 0L))

system.time({item_parms = dexter::fit_enorm(db)}) # user system elapsed # 10.80 0.63 11.07

system.time({pv = dexter::plausible_values(db, parms=item_parms, nPV=5)}) dim(pv) # user system elapsed # 11.44 0.50 11.93 # [1] 485490 8

`timeout`option, without which the download may time out. From the large file, we retain only the country identifier and the plausible values.

url3 = paste0(oecd, ’INT_STU12_DEC03.zip’) url4 = paste0(oecd, ’PISA2012_SAS_student.sas’) zipfile = tempfile(fileext=’.zip’) options(timeout = max(300, getOption("timeout"))) utils::download.file(url3, zipfile) fname = utils::unzip(zipfile, list=TRUE)$Name[1] utils::unzip(zipfile, files = fname, overwrite=TRUE) dict_quest = SAScii::parse.SAScii(sas_ri = url4) dict_quest = SAScii::parse.SAScii(sas_ri = url4) |> dplyr::mutate(end = cumsum(width), beg = end - width + 1) |> dplyr::filter(grepl(’CNT|PV.MATH’, varname)) data_quest = readr::read_fwf(file = fname, fwf_positions(dict_quest$beg, dict_quest$end, dict_quest$varname)) unlink(zipfile) unlink(fname)

`plausible_scores`function. The following code selects responses to items that have been asked in all participating countries, fits a different IRT model in each country, and computes plausible scores (there are also some trivial adjustments for items that did not have any correct responses in a few countries). As explained in Section 1, plausible scores are sum scores over all items, where item scores for the items actually asked to the person under the incomplete design are retained ‘as is’, while item scores for the remaining items are simulated from plausible values.

basket = dexter::get_items(db) |> dplyr::filter(in_basket == 1) |> dplyr::pull(item_id) basket = setdiff(basket, c(’PM828Q01’,’PM909Q01’, ’PM985Q03’)) resp = dexter::get_responses(db, predicate = item_id %in% basket, columns=c(’person_id’, ’item_id’, ’item_score’, ’cnt’)) ps = resp |> dplyr::filter (item_id %in% basket) |> dplyr::group_by(cnt) |> dplyr::do({dexter::plausible_scores(., dexter::fit_enorm(.), nPS=1)})

## 5. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## References

- R Development Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2005; ISBN 3-900051-07-0. [Google Scholar]
- Verhelst, N.D.; Glas, C.A.W.; Verstralen, H.H.F.M. OPLM: One Parameter Logistic Model. Computer Program and Manual; Cito: Arnhem, The Netherlands, 1993. [Google Scholar]
- Kiefer, T.; Robitzsch, A.; Wu, M. TAM: Test Analysis Modules, R Package Version 1.99-6. 2016. Available online: https://CRAN.Rproject.org/package=TAM (accessed on 28 March 2023).
- Chalmers, R.P. mirt: A Multidimensional Item Response Theory Package for the R Environment. J. Stat. Softw.
**2012**, 48, 1–29. [Google Scholar] [CrossRef] - Kolen, M.J.; Brennan, R.L. Test Equating, Scaling, and Linking, 2nd ed.; Springer: New York, NY, USA, 2004. [Google Scholar]
- von Davier, A. (Ed.) Statistical Models for Test Equating, Scaling, and Linking; Springer: New York, NY, USA, 2011. [Google Scholar]
- Davier, A.A.; Holland, P.W.; Thayer, D.T. The Kernel Method of Test Equating; Springer: New York, NY, USA, 2004. [Google Scholar]
- González, J.; Wiberg, M. Applying Test Equating Methods; Springer International Publishing: Cham, Switzerland, 2017. [Google Scholar]
- Bock, R.D. Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika
**1972**, 37, 29–51. [Google Scholar] [CrossRef] - Rasch, G. Probabilistic Models for Some Intelligence and Attainment Tests; University of Chicago Press: Chicago, IL, USA, 1960. [Google Scholar]
- Masters, G.N. A Rasch Model for Partial Credit Scoring. Psychometrika
**1982**, 47, 149–174. [Google Scholar] [CrossRef] - Holland, P.W. Measurements or contests? Comments on Zwick, Bond, and Allen/Donoghue. In Proceedings of the American Statistical Association: 1994 Proceedings of the Social Statistics Section, Toronto, Canada, 13–18 August 1994; American Statistical Association: Alexandria, VA, USA, 1995. [Google Scholar]
- International Association of Athletics Federations. IAAF Competition Rules 2018–2019, in Force from 1 November 2017; International Association of Athletics Federation: Monaco, 2017. [Google Scholar]
- Birnbaum, A. Some latent trait models and their use in inferring an examinee’s ability. In Statistical Theories of Mental Test Scores; Lord, F., Novick, M., Eds.; Addison-Wesley: Reading, MA, USA, 1968; Chapter 13; pp. 397–479. [Google Scholar]
- Robitzsch, A.; Lüdtke, O. Some thoughts on analytical choices in the scaling model for test scores in international large-scale assessment studies. Meas. Instrum. Soc. Sci.
**2022**, 4, 1–20. [Google Scholar] [CrossRef] - Partchev, I.; Maris, G. Irtoys: A Collection of Functions Related to Item Response Theory (IRT), R Package Version 0.2.2. 2022. Available online: https://CRAN.R-project.org/package=irtoys (accessed on 28 March 2023).
- Wickham, H. Tidy Data. J. Stat. Softw.
**2014**, 59. [Google Scholar] [CrossRef] - Hugh-Jones, D. huxtable: Easily Create and Style Tables for LaTeX, HTML and Other Formats, R Package Version 5.5.2. 2022. Available online: https://CRAN.R-project.org/package=huxtable (accessed on 28 March 2023).
- Haberman, S.J. The Interaction Model. In Multivariate and Mixture Distribution Rasch Models: Extensions and Applications; von Davier, M., Carstensen, C., Eds.; Springer: New York, NY, USA, 2007; Chapter 13; pp. 201–216. [Google Scholar]
- Andersen, E.B. The Numerical Solution of a Set of Conditional Estimation Equations. J. R. Stat. Soc. Ser. B (Methodological)
**1972**, 34, 42–54. [Google Scholar] [CrossRef] - Maris, G.; Bechger, T.; Martin, E.S. A Gibbs Sampler for the (Extended) Marginal Rasch Model. Psychometrika
**2015**, 80, 859–879. [Google Scholar] [CrossRef] - Marsman, M.; Maris, G.; Bechger, T.M.; Glas, C.A.W. What can we learn from plausible values? Psychometrika
**2016**, 81, 274–289. [Google Scholar] [CrossRef] - Bock, R.D.; Mislevy, R.J. Adaptive EAP Estimation of Ability in a Microcomputer Environment. Appl. Psychol. Meas.
**1982**, 6, 431–444. [Google Scholar] [CrossRef] - Warm, T.A. Weighted Likelihood Estimation of Ability in Item Response Theory. Psychometrika
**1989**, 54, 427–450. [Google Scholar] [CrossRef] - Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett.
**2006**, 27, 861–874. [Google Scholar] [CrossRef] - Krzanowski, W.; Hand, D. ROC Curves for Continuous Data; CRC Press: Boca Raton, FL, USA, 2009. [Google Scholar]
- Ashton, K.; Jones, N.; Maris, G.; Schouwstra, S.; Verhelst, N.; Partchev, I.; Koops, J.; Robinson, M.; Chattopadhyay, M.; Hideg, G.; et al. Technical Report for the First European Survey on Language Competences; Publications Office of the European Union: Luxembourg, 2012. [Google Scholar]
- Keuning, J.; Straat, J.H.; Feskens, R.C.W. The Data-Driven Direct Consensus (3DC) Procedure: A New Approach to Standard Setting. In Theoretical and Practical Advances in Computer-Based Educational Measurement; Springer Nature: Cham, Switzerland, 2017; pp. 263–278. [Google Scholar] [CrossRef]
- Von Davier, M.; Sinharay, S.; Oranje, A.; Beaton, A. The Statistical Procedures Used in National Assessment of Educational Progress: Recent Developments and Future Directions. In Handbook of Statistics; Elsevier: Amsterdam, The Netherlands, 2007; Volume 26, Chapter 32; pp. 1039–1055. [Google Scholar]
- Marsman, M.; Maris, G.; Bechger, T.; Glas, C. Turning simulation into estimation: Generalized exchange algorithms for exponential family models. PLoS ONE
**2017**, 12, e0169787. [Google Scholar] [CrossRef] [PubMed] - Zwitser, R.J.; Glaser, S.S.F.; Maris, G. Monitoring Countries in a Changing World: A New Look at DIF in International Surveys. Psychometrika
**2017**, 82, 210–232. [Google Scholar] [CrossRef] [PubMed] - Cuellar, E.; Partchev, I.; Zwitser, R.; Bechger, T. Making sense out of measurement non-invariance: How to explore differences among educational systems in international large-scale assessments. Educ. Assess. Eval. Account.
**2021**, 33, 9–25. [Google Scholar] [CrossRef] - Bechger, T.M.; Maris, G. A statistical test for Differential item pair functioning. Psychometrika
**2015**, 80, 317–340. [Google Scholar] [CrossRef] [PubMed] - Vansteelandt, K. Formal Methods for Contextualized Personality Psychology. Ph.D. Thesis, K. U. Leuven, Leuven, Belgium, 2000. [Google Scholar]
- Verhelst, N.D. Profile Analysis: A Closer Look at the PISA 2000 Reading Data. Scand. J. Educ. Res.
**2012**, 56, 315–332. [Google Scholar] [CrossRef] - Cressie, N.; Holland, P.W. Characterizing the manifest probabilities of latent trait models. Psychometrika
**1983**, 48, 129–141. [Google Scholar] [CrossRef] - Verhelst, N.D.; Glas, C.A. The one parameter logistic model. In Rasch Models; Springer: New York, NY, USA, 1995; pp. 215–237. [Google Scholar]
- Verhelst, N.D.; Glas, C.A.W.; Van der Sluis, A. Estimation problems in the Rasch model: The basic symmetric functions. Comput. Stat. Q.
**1984**, 1, 245–262. [Google Scholar] - Koops, J.; Bechger, T.; Maris, G. Bayesian Inference for Multistage and other Incomplete Designs. 2020; submitted. [Google Scholar]
- Maris, G.; Bechger, T.; Marsman, M. Bayesian Inference in Large-Scale Computational Psychometrics. In Computational Psychometrics: New Methodologies for a New Generation of Digital Learning and Assessment: With Examples in R and Python; Springer Nature: Cham, Switzerland, 2021; pp. 109–131. [Google Scholar]
- Andersen, E.B. A goodness of fit test for the Rasch model. Psychometrika
**1973**, 38, 123–140. [Google Scholar] [CrossRef] - Mislevy, R.J. Randomization-based inference about latent variables from complex samples. Psychometrika
**1991**, 56, 177–196. [Google Scholar] [CrossRef] - Marsman, M.; Bechger, T.B.; Maris, G.K. Composition algorithms for conditional distributions. In Essays on Contemporary Psychometrics; Springer Nature: Cham, Switzerland, 2022; pp. 219–250. [Google Scholar]
- Gelman, A.; Hill, J. Data Analysis Using Regression and Multilevel/Hierarchical Models; Cambridge University Press: Cambridge, UK, 2006. [Google Scholar] [CrossRef]
- Marin, J.M.; Robert, C.P. Bayesian Essentials with R; Springer: New York, NY, USA, 2014; Volume 48. [Google Scholar]
- Zwitser, R.J.; Maris, G. Conditional Statistical Inference with Multistage Testing Designs. Psychometrika
**2013**, 80, 65–84. [Google Scholar] [CrossRef] [PubMed] - Bechger, T.; Koops, J.; Partchev, I.; Maris, G. dexterMST: CML and Bayesian Calibration of Multistage Tests, R Package Version 0.9.3; 2022. Available online: https://CRAN.R-project.org/package=dexterMST (accessed on 28 March 2023).
- Damico, A.J. SAScii: Import ASCII Files Directly into R Using Only a SAS Input Script, R Package Version 1.0.1. 2022. Available online: https://CRAN.R-project.org/package=SAScii (accessed on 28 March 2023).
- Kreiner, S.; Christensen, K.B. Analyses of Model Fit and Robustness. A New Look at the PISA Scaling Model Underlying Ranking of Countries According to Reading Literacy. Psychometrika
**2013**, 79, 210–231. [Google Scholar] [CrossRef] [PubMed] - Oliveri, M.E.; von Davier, M. Investigation of model fit and score scale comparability in international assessments. Psychol. Test Assess. Model.
**2011**, 53, 315–333. [Google Scholar]

**Figure 3.**A clustered heatmap of the matrix of the inter-group differences in the pairwise differences in difficulty between items.

**Figure 6.**If we estimate item difficulties in groups of respondents with the same test scores, we commonly observe a linear relationship between item difficulty and test score. The

**left**panel shows a Rasch item whose difficulty is independent of the sum score. The panel on the

**right**shows an item conforming to the interaction model.

**Figure 8.**Country means of the first plausible value as estimated by PISA and plausible scores as estimated by dexter from country-specific IRT models.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Partchev, I.; Koops, J.; Bechger, T.; Feskens, R.; Maris, G.
dexter: An R Package to Manage and Analyze Test Data. *Psych* **2023**, *5*, 350-375.
https://doi.org/10.3390/psych5020024

**AMA Style**

Partchev I, Koops J, Bechger T, Feskens R, Maris G.
dexter: An R Package to Manage and Analyze Test Data. *Psych*. 2023; 5(2):350-375.
https://doi.org/10.3390/psych5020024

**Chicago/Turabian Style**

Partchev, Ivailo, Jesse Koops, Timo Bechger, Remco Feskens, and Gunter Maris.
2023. "dexter: An R Package to Manage and Analyze Test Data" *Psych* 5, no. 2: 350-375.
https://doi.org/10.3390/psych5020024