Performance of Large Language Models on Nursing Licensure Examinations: A Systematic Review and Meta-analysis

Amankwaa, I; Odoom, A; Kasim, A; Kobiah, E; Diebieri, M; Boateng, EA; Gyamfi, S; Hales, C

doi:10.1016/j.nedt.2026.107154

Performance of Large Language Models on Nursing Licensure Examinations: A Systematic Review and Meta-analysis

aut.relation.articlenumber	107154
aut.relation.journal	Nurse Education Today
aut.relation.startpage	107154
aut.relation.volume	165
dc.contributor.author	Amankwaa, I
dc.contributor.author	Odoom, A
dc.contributor.author	Kasim, A
dc.contributor.author	Kobiah, E
dc.contributor.author	Diebieri, M
dc.contributor.author	Boateng, EA
dc.contributor.author	Gyamfi, S
dc.contributor.author	Hales, C
dc.date.accessioned	2026-05-25T01:58:40Z
dc.date.available	2026-05-25T01:58:40Z
dc.date.issued	2026-05-12
dc.description.abstract	Objectives: This systematic review and meta-analysis assessed the performance of large language models (LLMs) in nursing licensure examinations. Despite the increasing use of LLMs in healthcare education, their capabilities in nursing licensure examinations remain uncertain. This study provides evidence on the accuracy and limitations of LLMs to help guide their integration into nursing education and licensure. Design: The systematic review and meta-analysis adhered to PRISMA 2020 guidelines. Data sources: PubMed, CINAHL, PsycINFO, EMCARE, and ERIC were searched from April to June 2025. Eligibility criteria: Studies were eligible if they evaluated LLMs (e.g., GPT-4, ChatGPT, Qwen-2.5) using multiple-choice nursing licensure questions under exam-like conditions and reported quantitative accuracy. Open-ended items were excluded from the meta-analysis due to incompatible scoring methods, but were narratively synthesised. Review methods: Two reviewers independently screened, extracted data, and appraised the risk of bias. A random-effects meta-analysis estimated pooled accuracy; subgroup and meta-regression analyses explored heterogeneity. Results: Twelve studies assessed 13,870 MCQs across seven exam systems and ten LLMs. Pooled accuracy was 69.6% (95% CI: 65.6–73.6%) with substantial heterogeneity (I<sup>2</sup> = 98%). GPT-4 outperformed GPT-3.5 (77.2% vs. 60.4%); domain-customised and newer models reached 93.6%. LLMs excelled in general medicine and pharmacology but underperformed in ethics and psychosocial integrity. Accuracy did not differ significantly by exam system (p = 0.14), question difficulty (p = 0.90) or format (p = 0.96). In meta-regression, Custom GPT (p = 0.0006) and Qwen 2.5 (p = 0.026) were the only significant predictors of higher accuracy; no exam system, question format, or difficulty level reached significance. Methodological variability and underreporting of model parameters were common. Conclusions: LLMs show promise for low-stakes educational applications, such as formative assessments within hybrid teaching models; however, they are unsuitable for unmoderated, high-stakes licensure decisions due to inconsistent performance. Regulatory guidelines, equitable access, and nursing-specific model development are needed to ensure fairness and validity. Research must prioritise standardised frameworks, error analysis, and broader geographic representation to address these limitations.
dc.identifier.citation	Nurse Education Today, ISSN: 0260-6917 (Print); 1532-2793 (Online), Elsevier BV, 165, 107154-. doi: 10.1016/j.nedt.2026.107154
dc.identifier.doi	10.1016/j.nedt.2026.107154
dc.identifier.issn	0260-6917
dc.identifier.issn	1532-2793
dc.identifier.uri	http://hdl.handle.net/10292/21211
dc.language	eng
dc.publisher	Elsevier BV
dc.relation.uri	https://www.sciencedirect.com/science/article/pii/S0260691726001826
dc.rights	© 2026 The Author(s). Published by Elsevier Ltd. This is an open access article distributed under the terms of the Creative Commons CC-BY license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. You are not required to obtain permission to reuse this article.
dc.rights.accessrights	OpenAccess
dc.rights.uri	https://creativecommons.org/licenses/by/4.0/
dc.subject	Artificial Intelligence
dc.subject	Educational measurement
dc.subject	Large language models
dc.subject	Licensure
dc.subject	Nursing
dc.subject	Systematic review
dc.subject	4203 Health Services and Systems
dc.subject	42 Health Sciences
dc.subject	1110 Nursing
dc.subject	1302 Curriculum and Pedagogy
dc.subject	3901 Curriculum and pedagogy
dc.subject	4204 Midwifery
dc.subject	4205 Nursing
dc.title	Performance of Large Language Models on Nursing Licensure Examinations: A Systematic Review and Meta-analysis
dc.type	Journal Article
pubs.elements-id	761826

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Amankwaa et al._2026_Performance of large language models on nursing licensure examinations_.pdf
Size:: 924.81 KB
Format:: Adobe Portable Document Format
Description:: Journal article

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.37 KB
Format:: Plain Text
Description:

Download

Collections

School of Nursing