AN EXPLAINABLE CALIBRATION AND SHAP AUDIT OF HELP-SEEKING FEATURES IN CLASSICAL KNOWLEDGE TRACING

Dr. Mustafa Hameed; Dr. Musarat Karim; Dr. Muhammad Nauman; Ms. Alisha Fida; Dr. Nadia Khan

Authors

Dr. Mustafa Hameed
Dr. Musarat Karim
Dr. Muhammad Nauman
Ms. Alisha Fida
Dr. Nadia Khan

Keywords:

AN EXPLAINABLE CALIBRATION AND SHAP AUDIT OF, HELP-SEEKING FEATURES IN CLASSICAL, KNOWLEDGE TRACING

Abstract

Does richer feature engineering substitute for, add to, or simply lose out to model expressivity in classical knowledge tracing? We put this question to a direct test. In addition to the usual outcome-only baseline of prior accuracy and prior attempt counts, we layer two further families of behavioural features: a help-seeking block (rolling hint count, bottom-hint usage, attempts-per-problem, first-action rate) and a response-time block (rolling mean log-first-response-time and log-overlap-time). On the canonical ASSISTments 2009-10 Skill Builder corpus (433 161 attempts, 4 163 students, 123 skills), we fit PFA, LR-KT, LightGBM, and XGBoost across three nested feature blocks (outcome-only, +help-seeking, +response-time), reporting AUC, F1, ECE, and Brier with bootstrap 95 % CIs on a user-grouped 80/20 split. The headline result contradicts the prior literature. A well-tuned LightGBM on the six-feature outcome-only block was the single best configuration we found (AUC 0.838 [0.836, 0.841]; ECE 0.0099 [0.0081, 0.0123]); adding the help-seeking features actually lowered AUC to 0.832 and pushed ECE up to 0.0140, with CIs that did not overlap. TreeSHAP pins down the mechanism: the outcome features carry roughly 5× the global attribution magnitude of the help-seeking and response-time blocks combined, and the per-skill SHAP shows help-seeking doing its worst calibration damage precisely on the skills where the GBM is already miscalibrated. The linear LR-KT baseline behaves in the opposite direction, exhibiting a small ECE gain from the response time history at Block C. The pattern is consistent throughout: feature richness can stand in for model expressivity but does not compound with it, and on this corpus, expressivity wins. We released the pipeline, fitted models, per-skill reliability diagrams, and verified bibliography.