Vulnerability of LLMs in Educational Assessment

10.7910/DVN/OV2WAMMilani, AlfredoAlfredoMilani0000-0003-4534-1805https://ror.org/035mh1293Vulnerability of LLMs in Educational AssessmentHarvard Dataverse2025Computer and Information ScienceSocial SciencesLarge Language ModelsPrompt InjectionEducation SciencesEducation EvaluationTrustworthy AIHuman-in-the-Loop AIMilani, AlfredoAlfredoMilaniLink Campus University, Rome, ItalyValentina FranzoniValentinaFranzoniUniversity of Perugia, ItalyFlorindi EmanueleUniversity of Modena-Reggio Emilia2025-09-122025-09-122227-71024804924application/zip1.0CC0 1.0The dataset contains the output of experiments on a research project on Vulnerability of LLMs in Educational Assessment. The Dataset contains: -the students assignments data in normal form and the injected form -the output produced by the experimented LLMs: ChatGPT, Gemini, DeepSeek, Grok, Perplexity and Copilot for the experiments evaluation the assignments, as a single document and collectively as a group of documents, denominated: -User Legitimate LLMs Prompts -Normal (no injection) providing the reference base evaluation -Prompt Injection Pass, one type of injection experiments, called Fail-To-Top, to move an assignment evailuated FAIL by reference base evaluation to PASS, i.e. above 35% of total points. -Prompt Injection to Top25 , a type of injection experiments to move to top 25% an assignment with lowe reference base evaluation . This latter type of experiment come in 3 versions, Fail-To-Top, Sat-To-Top, Good-To-Top where assignment with reference base evaluation respectively: Fail (below 35%), Satisfactory (greater than 25% and belo 50%) and Good (above 50% and below 75%) are considered for injection. The name of the folders and output results files are accordingly self-explanatory .