Investigating Serial Position Effects in Human and LLM-Based Raters' L2 Writing Scoring: A Comparative Study
Nurgül Bekdemir, School of Foreign Languages, Ordu University, Ordu, Turkiye (Turkey)
Turgay Han, Ordu University, Ordu, Turkiye (Turkey)
Abstract
Large Language Models (LLMs) are increasingly utilized in automated essay scoring; however, potential structural and cognitive biases in rubric application remain largely underexplored. This ongoing study aims to investigate the serial position effect of rubric criterion order on English as a Foreign Language (EFL) writing scores assigned by human and LLM-based raters. To achieve this objective, a counterbalanced within-subjects mixed-methods design (formally approved by the Ordu University Ethics Committee, No. 2026-130) is being conducted (Corriero, 2017). A sample of 10 human raters and an LLM (ChatGPT) serve as the scoring cohorts. Raters evaluate 24 essays, spanning high and low proficiency bands, using a departmental analytic rubric presented in three distinct formats: standard, reversed, and randomized criterion order. The quantitative phase examines the final scores assigned across formats to detect variance. The qualitative phase captures cognitive processes using Think-Aloud Protocols (TAPs) for human raters (Ericsson & Simon, 1993), alongside Chain-of-Thought (CoT) prompting and detailed textual justifications generated by the LLM rater. Data analysis aims to reveal whether rubric layout induces a primacy or recency effect in either rater group, thereby threatening scoring stability. The expected outcomes will provide critical insights into the cognitive and algorithmic reliability of human and automated writing assessments, offering implications for prompt engineering and rubric design.
Innovation in Language Learning

























