Department
Barowsky School of Business
Document Type
Podium Presentation
Presentation Date
2026
Sponsorship/Conference/Institution
Academy of Management Annual Conference
Location
Philadelphia, PA
Abstract/Presentation Excerpt
Constructed response assessments provide rich evidence of transfer and deep learning but require resource-intensive human scoring that creates practical barriers for management researchers. This study examines the reliability and validity of generative AI scoring as an alternative to traditional human scoring for research applications. Using an EvidenceCentered Design framework, we evaluated three commercial GenAI systems (GPT-4o, Claude 3.7 Sonnet, Gemini 2.5 Flash) scoring constructed responses assessing transfer of learning in motivation and leadership content domains (N = 243). Results demonstrated that GenAI achieved interrater reliability comparable to trained human raters, excellent test-retest reliability (ICC > .90) across all items, strong convergent validity with both research assistant and subject matter expert scores, and equivalent construct validity in predicting retention performance. Item-level analyses revealed that scoring challenges stemmed from item characteristics rather than rater type. These findings provide substantial psychometric evidence supporting GenAI as a reliable, valid, and accessible alternative to human scoring for constructed response assessments in management research contexts.
Rights
The Authors. All rights reserved.