Prompting and Output Evaluation in ChatGPT for Teaching and Learning - A Review of Empirical Studies Using Machine Learning
Wenting Sun, Humboldt-Universität zu Berlin (Germany)
Jiangyue Liu, Suzhou University (China)
Xiaoling Wang, Zhejiang Normal University (China)
Abstract
Large variations in output generated by generative Artificial Intelligences (AIs) can be influenced by slight changes in prompting. Understanding prompt usage in education can reduce the trial-and-error efforts for educators and learners using AIs driven by Large Language Models (LLMs). From the human-computer interaction (HCI) researchers’ perspective, lack of guidance, representation of tasks and efforts, and generalization of prompts are challenges of interactive use of prompting [1]. Therefore, it is important to glean practice experiences and lessons from existing prompting usage articles. Existing reviews on prompt engineering often contain technical terms or lack synthesis of output evaluation methods.
This review explores how non-AI experts construct prompts in education and the methods used for output content evaluation. Using ChatGPT as an example, this review synthesizes prompt engineering behaviors in education, combining Biggs’s Presage-Process-Product (3P) model and the Turn, Expression, Level of Details, Role (TELeR) taxonomy as the analytical framework [2,3]. Data were sourced from the Web of Science and Scopus, following PRISMA guidelines [4], resulting in a dataset of 495 empirical articles on ChatGPT in education. Detailed analysis of 102 articles with prompting details was conducted using thematic analysis, clustering analysis, and Machine Learning (ML) and Natural Language Processing (NLP) techniques to explore the possibility of automatically classifying articles with and without prompt details.
Six groups emerged from the clustering of coding results. The combination of bigrams and the Naïve Bayes (NB) algorithm or TF-IDF and Support Vector Machine (SVM) outperformed in classifying articles with and without prompting details. Findings suggest that domain knowledge can complement the insufficient prompting skills of non-AI experts. The study identifies specific features and patterns in prompt construction across three stages and suggests future directions for analyzing ChatGPT usage behaviors.
Keywords Prompt engineering, Generated output evaluation, Human-AI interaction, ChatGPT, Machine learning
REFERENCES
[1] Dang H., Mecke L., Lehmann F., Goller S., Buschek D., “How to prompt? Opportunities and challenges of zero-and few-shot learning for human-AI interaction in creative applications of generative models”, arXiv preprint arXiv:2209.01390, 2022.
[2] Biggs J., Kember D., Leung D. Y., “The revised two‐factor study process questionnaire: R‐SPQ‐2F”, British Journal of Educational Psychology, 71(1), 133-149, 2001.
[3] Santu S. K. K., Feng D., “TELeR: A general taxonomy of LLM prompts for benchmarking complex tasks”, arXiv preprint arXiv:2305.11430, 2023.
[4] Page M. J., McKenzie J. E., Bossuyt P. M., Boutron I., Hoffmann T. C., Mulrow C. D., Moher D., “The PRISMA 2020 statement: an updated guideline for reporting systematic reviews”, International Journal of Surgery, 88, 105906, 2021.