Summary: | To address the complexity and high costs of developing listening tests for test-takers of varying proficiency levels, this study investigates the capabilities of an OpenAI's large language model, ChatGPT 4, in developing listening assessments. Employing prompt engineering and fine-tuning of prompts, the study specifically focuses on creating listening scripts and test items using ChatGPT 4 for test-takers across a spectrum of proficiency levels (academic, low, intermediate, and advanced). For comparability, the 24 topics of these scripts were selected from topics found in academic listening tests. We conducted two types of analyses to evaluate the quality of the output. First, we performed linguistic analyses of the scripts using Coh-Metrix and Text Inspector to determine if the scripts varied linguistically as required by the prompts. Second, we analyzed topic variation and the degree of overlap in the test items. Results indicated that while ChatGPT 4 reliably produced scripts with significant textual variations, the test items generated were often long and exhibited semantic overlaps among options. This effect was also influenced by the topic. We discuss the ethical complexities that arise from the use of generative artificial intelligence (AI), and how generative AI (GenAI) can potentially benefit practitioners and researchers in language assessment, while recognizing its limitations.
|