Repository logo
 

Performance of Large Language Models on Nursing Licensure Examinations: A Systematic Review and Meta-analysis

aut.relation.articlenumber107154
aut.relation.journalNurse Education Today
aut.relation.startpage107154
aut.relation.volume165
dc.contributor.authorAmankwaa, I
dc.contributor.authorOdoom, A
dc.contributor.authorKasim, A
dc.contributor.authorKobiah, E
dc.contributor.authorDiebieri, M
dc.contributor.authorBoateng, EA
dc.contributor.authorGyamfi, S
dc.contributor.authorHales, C
dc.date.accessioned2026-05-25T01:58:40Z
dc.date.available2026-05-25T01:58:40Z
dc.date.issued2026-05-12
dc.description.abstractObjectives: This systematic review and meta-analysis assessed the performance of large language models (LLMs) in nursing licensure examinations. Despite the increasing use of LLMs in healthcare education, their capabilities in nursing licensure examinations remain uncertain. This study provides evidence on the accuracy and limitations of LLMs to help guide their integration into nursing education and licensure. Design: The systematic review and meta-analysis adhered to PRISMA 2020 guidelines. Data sources: PubMed, CINAHL, PsycINFO, EMCARE, and ERIC were searched from April to June 2025. Eligibility criteria: Studies were eligible if they evaluated LLMs (e.g., GPT-4, ChatGPT, Qwen-2.5) using multiple-choice nursing licensure questions under exam-like conditions and reported quantitative accuracy. Open-ended items were excluded from the meta-analysis due to incompatible scoring methods, but were narratively synthesised. Review methods: Two reviewers independently screened, extracted data, and appraised the risk of bias. A random-effects meta-analysis estimated pooled accuracy; subgroup and meta-regression analyses explored heterogeneity. Results: Twelve studies assessed 13,870 MCQs across seven exam systems and ten LLMs. Pooled accuracy was 69.6% (95% CI: 65.6–73.6%) with substantial heterogeneity (I<sup>2</sup> = 98%). GPT-4 outperformed GPT-3.5 (77.2% vs. 60.4%); domain-customised and newer models reached 93.6%. LLMs excelled in general medicine and pharmacology but underperformed in ethics and psychosocial integrity. Accuracy did not differ significantly by exam system (p = 0.14), question difficulty (p = 0.90) or format (p = 0.96). In meta-regression, Custom GPT (p = 0.0006) and Qwen 2.5 (p = 0.026) were the only significant predictors of higher accuracy; no exam system, question format, or difficulty level reached significance. Methodological variability and underreporting of model parameters were common. Conclusions: LLMs show promise for low-stakes educational applications, such as formative assessments within hybrid teaching models; however, they are unsuitable for unmoderated, high-stakes licensure decisions due to inconsistent performance. Regulatory guidelines, equitable access, and nursing-specific model development are needed to ensure fairness and validity. Research must prioritise standardised frameworks, error analysis, and broader geographic representation to address these limitations.
dc.identifier.citationNurse Education Today, ISSN: 0260-6917 (Print); 1532-2793 (Online), Elsevier BV, 165, 107154-. doi: 10.1016/j.nedt.2026.107154
dc.identifier.doi10.1016/j.nedt.2026.107154
dc.identifier.issn0260-6917
dc.identifier.issn1532-2793
dc.identifier.urihttp://hdl.handle.net/10292/21211
dc.languageeng
dc.publisherElsevier BV
dc.relation.urihttps://www.sciencedirect.com/science/article/pii/S0260691726001826
dc.rights© 2026 The Author(s). Published by Elsevier Ltd. This is an open access article distributed under the terms of the Creative Commons CC-BY license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. You are not required to obtain permission to reuse this article.
dc.rights.accessrightsOpenAccess
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectArtificial Intelligence
dc.subjectEducational measurement
dc.subjectLarge language models
dc.subjectLicensure
dc.subjectNursing
dc.subjectSystematic review
dc.subject4203 Health Services and Systems
dc.subject42 Health Sciences
dc.subject1110 Nursing
dc.subject1302 Curriculum and Pedagogy
dc.subject3901 Curriculum and pedagogy
dc.subject4204 Midwifery
dc.subject4205 Nursing
dc.titlePerformance of Large Language Models on Nursing Licensure Examinations: A Systematic Review and Meta-analysis
dc.typeJournal Article
pubs.elements-id761826

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Amankwaa et al._2026_Performance of large language models on nursing licensure examinations_.pdf
Size:
924.81 KB
Format:
Adobe Portable Document Format
Description:
Journal article

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.37 KB
Format:
Plain Text
Description: