Department

Barowsky School of Business

Document Type

Podium Presentation

Presentation Date

2026

Sponsorship/Conference/Institution

Academy of Management Annual Conference

Location

Philadelphia, PA

Abstract/Presentation Excerpt

Constructed response assessments provide rich evidence of transfer and deep learning but require resource-intensive human scoring that creates practical barriers for management researchers. This study examines the reliability and validity of generative AI scoring as an alternative to traditional human scoring for research applications. Using an EvidenceCentered Design framework, we evaluated three commercial GenAI systems (GPT-4o, Claude 3.7 Sonnet, Gemini 2.5 Flash) scoring constructed responses assessing transfer of learning in motivation and leadership content domains (N = 243). Results demonstrated that GenAI achieved interrater reliability comparable to trained human raters, excellent test-retest reliability (ICC > .90) across all items, strong convergent validity with both research assistant and subject matter expert scores, and equivalent construct validity in predicting retention performance. Item-level analyses revealed that scoring challenges stemmed from item characteristics rather than rater type. These findings provide substantial psychometric evidence supporting GenAI as a reliable, valid, and accessible alternative to human scoring for constructed response assessments in management research contexts.

Rights

The Authors. All rights reserved.

Share

COinS