Introduction: Why put AI to the test?
Exploration of AI in chemistry research has been ongoing for many years, but it has accelerated dramatically in the last decade. Notably, the 2024 Nobel Prize in Chemistry recognized work that enabled the AI-tool AlphaFold to predict protein structures (Aqvist, 2024). Today, AI is a regular partner in many chemistry research processes: from protein structure prediction and materials discovery to synthesis planning and interpreting complex datasets. With the rapid spread of generative AI (GenAI) tools like ChatGPT, many students are also using these systems for brainstorming ideas, describing scientific processes, and browsing for information.
GenAI really changes how students encounter information. For a common task like literature searching for a scientific review, or simply browsing for background information, a chatbot can produce a scientific-sounding explanation and plausible-looking citations in seconds. Yet the output may be incomplete, outdated, irrelevant, or simply wrong. In response, instructors have been asking a practical question: how do these tools fit into established coursework in ways that genuinely support learning?
Although much of the attention has focused on GenAI “doing the work for students” or enabling plagiarism, a more immediate risk is quieter: citation errors and fabricated references that slip into academic writing because they seem credible. A widely reported example outside academia illustrates the stakes: an attorney submitted AI-generated court citations that turned out not to exist, prompting scrutiny and potential sanctions (Bohannon, 2023). Chemistry students face a similar hazard when GenAI chatbots confidently generate journal articles that cannot be confirmed.
To address this challenge, a GenAI-assisted writing assignment was designed through collaboration between a librarian and a chemistry instructor (Reddy, 2023). The assignment treats AI not as a shortcut, but as a prompt for critical evaluation: students used GenAI during the writing process and then verified each AI-suggested reference for accuracy and relevance. The goal was to make research integrity tangible and actionable, shifting from a regular rule of “cite your sources” to “validate what you plan to rely on,” even (and especially) when those sources are generated by AI.
Process: What did we do?
Students in two chemistry courses were asked to write essays with the help of GenAI tools (U-M GPT, ChatGPT, and Claude) (Reddy, 2023). But there was a catch: every citation proposed by the chatbot had to be meticulously checked for factual accuracy and contextual relevance. An introductory library lecture set expectations, warning students about potential “hallucinations” resulting in references that look legitimate but don’t exist or that contain incorrect bibliographic details (fake references). A library lecture also addressed prompt literacy, ethical considerations for AI use, and verification workflows, such as confirming citation details in disciplinary databases, and cross-checking article metadata. Students also shared works-in-progress through short in-class presentations, where peer feedback helped them refine AI-assisted drafts into coherent scientific essays, and then reflect on when (and when not) GenAI was useful. Finally, requiring submission of GenAI-assisted drafts gave instructors a checkpoint for feedback, supported oversight of citation practices, and provided material for analyzing patterns in AI output.
Findings: What surprised us?
The results were equal parts enlightening and exasperating.
Across 456 GenAI-generated citations, 47% contained at least one fabricated element, typically a bogus publication year, issue, or volume (Fig. 1).
Figure 1. Average frequency of inaccurate individual reference components in GenAI-assisted student essays (e.g., a value of 26.1% for "Title" means that publication titles were cited incorrectly in 26.1% of references per essay, on average).
Several students complained that none of the AI-supplied references for their topic actually existed. One student summarized the experience this way:
A particularly frustrating issue was the accuracy of sources. When I requested citations, the AI provided references that were not only irrelevant but often fabricated. All of the papers it cited did not exist, and the titles were entirely incorrect. I was eventually unable to find any similar articles and eventually had to find new articles to support the essay. This made me realize that AI-generated content cannot be trusted to provide reliable or verifiable citations. As a result, I spent a substantial amount of time fact-checking nearly every statement in the AI-generated essays and searching for relevant articles, which undermined the efficiency of using the tool.
Our bibliometric analysis revealed an uneven pattern in what GenAI “gets right” (Fig. 2, 3).
Citations to high-profile, multidisciplinary journals (e.g., Nature family journals) and to articles with stronger public visibility (e.g., higher Altmetric scores and more Wikipedia mentions) were more likely to be accurate.
By contrast, disciplinary chemistry journals that are central to students’ coursework showed higher rates of citation problems.
Figure 2. Normalized distribution of GenAI-generated references (Real vs. Fabricated) by journal scope.
Figure 3. Journal and article parameters (mean) for three different types of AI-generated citations. The results show that GenAI-generated references that are correct tend to be associated with the highest mean Altmetric scores, higher citation counts, and journals of greater impact, relative to fabricated or incomplete references.
Importantly, even “real” references weren’t necessarily useful. Of all citations students were able to verify as authentic, only 58% were actually relevant to the essay topic (Sevryugina & Vargas, 2026).
Students’ frustration with fake or irrelevant citations often gave way to curiosity as students began triangulating sources, searching disciplinary databases more strategically, and sharpening their judgment about what counts as credible and relevant evidence.
Conclusions & next steps: Building AI literacy for tomorrow’s scholars
Repeated encounters with fabricated citations nudged many students from passive reliance on polished AI output to active, critical engagement with scholarly material. In practice, that shift looked like checking records across platforms, comparing metadata, reading abstracts for fit, and learning, sometimes the hard way, that what “looks real” is not the same as “real.”
That learning came with real costs. Students reported spending substantial time (often more than an hour) validating references and integrating GenAI output with their own writing. For some, the verification workload generated frustration and reduced interest in using GenAI tools in future assignments. More concerning, despite explicit instructions to verify and correct bibliographies, we observed evidence of breakdowns in follow-through: in 49% of student reports fabricated references still appeared in final essays. Whether driven by time pressure, overconfidence, or disengagement, these cases underscore a key point for instruction: AI literacy requires not only awareness of limitations but also structures that support students in acting on that awareness.
Several implications stand out.
- More robust training on information validation, including efficient verification workflows and triangulation strategies
- Expanded discussion of ethical and legal dimensions, including bias, transparency, copyright, and appropriate disclosure
- Stronger guidance on using trusted disciplinary databases alongside (or instead of) chatbot-supplied references
- Continued emphasis on peer review and instructor oversight, with checkpoints that make verification visible and assessable
Through this experience, we learned that when structured well, each frustrating “fake” citation can become a teachable moment that builds durable skills in information literacy, research integrity, and responsible scholarship. Those skills matter because, despite AI’s promise, technology alone isn’t enough to maintain scholarly standards. A reference list isn’t just academic “housekeeping”; it documents the intellectual trail behind an argument and helps prevent misattribution and plagiarism. Citations influence how articles are indexed and discovered, and they shape the metrics used to assess scholarly impact. When references are inaccurate, the damage can extend beyond a single assignment by weakening the coherence of an argument, distorting evidence in reviews, and in some cases contributing to faulty conclusions. A central takeaway of this study is that authors remain responsible for the accuracy and relevance of their citations; algorithms cannot assume that responsibility.
ACKNOWLEDGMENTS
I would like to acknowledge the students enrolled in selected classes at the University of Michigan for their survey responses and engagement with the GenAI-assisted writing assignment. We are also grateful to Dr. Nils Walter and Dr. Nicolai Lehnert for providing access to student reports used in this study. Special thanks to Diego Vargas and Grace Allison for their assistance with reference validation and to Undergraduate Research Opportunity Program (UROP) for supporting these students. We also thank the Center for Academic Innovation at the University of Michigan - Ann Arbor for project funding.
References
(Aqvist, 2024) Aqvist, J. (2024, October 9). Computational protein design and protein structure prediction. The Nobel Committee for Chemistry. Available online: https://www.nobelprize.org/uploads/2024/10/advanced-chemistryprize2024.pdf (accessed on 14 July 2025).
(Bohannon, 2023) Bohannon, M. (2023, Jun 8). Lawyer Used ChatGPT In Court - And Cited Fake Cases. A Judge Is Considering Sanctions. Forbes. https://www.forbes.com/sites/mollybohannon/2023/06/08/lawyer-used-chatgpt-in-court-and-cited-fake-cases-a-judge-is-considering-sanctions/ (accessed in Sep 2025).
(Reddy, 2023) Reddy, M. R.; Walter, N. G.; Sevryugina, Y. V. Implementation and Evaluation of a ChatGPT-Assisted Special Topics Writing Assignment in Biochemistry. J. Chem. Educ. 2024, 101 (7), 2740–2748. DOI: 10.1021/acs.jchemed.4c00226.
For full details about this research, see: Yulia V. Sevryugina, Diego Vargas, Teaching Research Integrity through Verification of AI-Generated References: An Activity for Upper-Level Chemistry Courses, Journal of Chemical Education 2026, 103(5), 2610–2620, DOI: 10.1021/acs.jchemed.5c01620
For student perceptions, see: Sevryugina, Y. V.; Collins-Thompson, K.; Walter, N.G., Integrating AI Literacy in Chemistry Graduate Education: Harnessing the Power of Transformer-Based Models. AI Edu. 2026, 2(2), 14, https://doi.org/10.3390/aieduc2020014.