# Tracking with (Un)Certainty

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

## 1. Introduction

#### 1.1. Elo Rating System

- ${R}_{n}$ is the new rating after the event.
- ${R}_{o}$ is the pre-event rating.
- K is the rating point value of a single game score.
- W is the actual game score, each win counting 1, each draw $1/2$.
- ${W}_{e}$ is the expected game score based on ${R}_{o}$.

#### 1.2. Math Garden

#### 1.3. Research with Math Garden

#### 1.4. Challenges in Elo Rating Systems

#### 1.5. Alternatives to Elo Rating Systems

#### 1.6. Three Problems in Rating Systems

#### 1.7. Outline

## 2. Methods

#### 2.1. The Urnings Algorithm

Algorithm 1: Game of Chance |

repeat${Y}_{p}\sim Bernoulli\left({\pi}_{p}\right)$ ${Y}_{i}\sim Bernoulli\left({\pi}_{i}\right)$ until ${Y}_{p}\ne {Y}_{i}$return ${X}_{pi}={Y}_{p}$ |

Algorithm 2: Game of Chance with Urnings |

repeat${Y}_{p}^{*}\sim Bernoulli({u}_{p}/{n}_{p})$ ${Y}_{i}^{*}\sim Bernoulli({u}_{i}/{n}_{i})$ until ${Y}_{p}^{*}\ne {Y}_{i}^{*}$return ${X}_{pi}^{*}={Y}_{p}^{*}$ |

#### 2.2. Simulation Setup

## 3. Results

#### 3.1. Simulation Results

#### 3.2. Real Data Example: Math Garden

#### 3.2.1. Description of the Data

#### 3.2.2. Results

## 4. Discussion

## Author Contributions

## Funding

## Conflicts of Interest

## Abbreviations

ERS | Elo Rating System |

CAL | Computer Adaptive Learning |

BKT | Bayesian Knowledge Tracing |

IRT | Item Response Theory |

## Appendix A. Illustration of the MH-Step

**Figure A1.**Joint distribution of the current and the new states of the Markov chain and the invariant distribution of the chain (black curve on top) compared to the standard normal distribution (blue dashed curve). where (

**a**) the transition kernel is selected randomly, (

**b**) the transition kernel is selected dependent on the current state, and (

**c**) the transition kernel depends on the current state, but this dependence is corrected for in a Metropolis–Hastings step.

1. | see Deonovic et al. (2018) for a description of the relation between IRT and BKT. |

2. | In international chess competitions, this is also recognized, the FIDE handbook describes how rating drift should be monitored in article 10, https://www.fide.com/fide/handbook.html?id=197&view=article. |

3. | The interested reader can find the simulation code in the following OSF project: https://osf.io/8wgvb/. |

4. | This is lower than for the persons, but not inconsistent with the 95% confidence interval since there are only 100 items. |

5. | In comparing observed and expected rating distributions, the proper error distribution is added to the expected ratings, see for example (Brinkhuis 2014). |

6. | The interested reader can find the code to estimate the Urnings algorithm in this OSF project: https://osf.io/8wgvb/ and access to the data can be acquired by contacting the first author. |

**Figure 1.**A screenshot of a single item in the Deductive Mastermind game (

**left**) and in the Subtraction game (

**right**). In the Deductive Mastermind game the coloured circles refer to either a correctly placed flower (green), a correct flower at the wrong location (orange) or a wrong flower (red). For this item the orange flower is the correct solution.

**Figure 2.**Rating variance inflation in a simulated (

**left**) and real data example (

**right**) using the Elo Rating System (ERS).

**Figure 3.**The left panel shows the true ${\pi}_{i}$ versus estimated ${u}_{i}/{n}_{i}$ item ratings, and the 95% confidence interval (CI) implied by the urn size. The right panel shows the SD of the item and person ratings throughout the simulation. The horizontal lines reflect the SD based on the samples from the true values.

**Figure 4.**The left panel depict the estimates of two players (one with a low and one with medium true value throughout the simulation, with the grey bars indicating the 95% CI. The right panel shows close correspondence of the cumulative density function based on the estimates (coloured dots) and based on the true values (black line).

**Figure 5.**The tracked rating follows the true value of a simulated player that showed a jump in ability at iteration 9800.

**Figure 6.**A visualisation of model fit for both analysed games, by comparing observed (black dots) and expected (blue line) probabilities of a correct response for each of the binned differences between the logits of ${u}_{p}/{n}_{p}$ and ${u}_{i}/{n}_{i}$.

**Figure 7.**The average rating of players between 2015 and 2019 (Mastermind) and 2014 and 2017 (Subtraction) of children born in 2007.

**Figure 8.**The rating development (and the CI in grey) of a player. The horizontal line at $Up/N=0.7$ indicates an (arbitrary) reference point that could indicate a sufficient ability in the subtraction domain.

