A COUNTERFACTUAL OFF-POLICY AUDIT OF ELO-DRIVEN ITEM SELECTION IN AN ADAPTIVE GEOGRAPHY TUTOR

Dr. Mustafa Hameed; Dr. Musarat Karim; Dr. Muhammad Nauman; Dr. Nadia Khan; Ms. Alisha Fida

Authors

Dr. Mustafa Hameed
Dr. Musarat Karim
Dr. Muhammad Nauman
Dr. Nadia Khan
Ms. Alisha Fida

Keywords:

A COUNTERFACTUAL OFF-POLICY AUDIT OF ELO-DRIVEN ITEM, SELECTION IN AN ADAPTIVE GEOGRAPHY TUTOR

Abstract

Adaptive learning systems make millions of item-selection decisions daily, and yet learning-analytics practice still judges them almost entirely through retrospective A/B tests or by re-running the deployed policy on a held-out cohort. We pose a sharper question about an adaptive geography tutor whose policy is publicly logged: Is the deployed Elo-driven item-selection policy operationally distinguishable from uniform random sampling on the reward the system optimises? The tools we bring to it are classical counterfactual estimators from the contextual bandits literature, which reconstruct principled, comparable estimates of what would have happened under alternative policies from the logged interaction data alone. We run IPS, self-normalised IPS, doubly robust (DR), and Switch-DR estimators on the public Slepemapy.cz log (≈ 3.4 M answers, 36 947 students, 1 681 items, five top-volume geographic contexts), pitting the deployed adaptive policy against uniform-random, easiest-first, and hardest-first counterfactual policies under two reward signals: response correctness and log-response-time. The results were clustered around three points. The headline comes first: on immediate correctness, the deployed Elo-driven policy is indistinguishable from uniform-random sampling (DR uniform − deployed: ≈ ±0.02 across all five contexts), which means that whatever value the policy carries must live in a delayed-reward channel that the public log does not expose. Next comes the size of the available headroom: a deterministic easiest-first target policy would lift DR-estimated correctness by +13 to +18 pp over the deployed policy, while hardest-first would cut it by −11 to −19 points, with bootstrap 95 % CIs that exclude zero. The third point is a cautionary one about the estimators themselves: IPS turns pathological under the deterministic targets, its sample mean exceeding 1.0 in three of five contexts (up to 1.47), whereas SNIPS and DR stay bounded in [0, 1] and agree to within 0.04 of each other, a textbook illustration of the variance / bias trade-off the OPE literature has long flagged. We report bootstrap 95 % CIs and effective sample size for every (estimator, policy, clip) cell, audit estimator variance under weight clipping (), and stratify by novice versus expert sub-population.