Performance of Large Language Models on Nursing Licensure Examinations: A Systematic Review and Meta-analysis
| aut.relation.articlenumber | 107154 | |
| aut.relation.journal | Nurse Education Today | |
| aut.relation.startpage | 107154 | |
| aut.relation.volume | 165 | |
| dc.contributor.author | Amankwaa, I | |
| dc.contributor.author | Odoom, A | |
| dc.contributor.author | Kasim, A | |
| dc.contributor.author | Kobiah, E | |
| dc.contributor.author | Diebieri, M | |
| dc.contributor.author | Boateng, EA | |
| dc.contributor.author | Gyamfi, S | |
| dc.contributor.author | Hales, C | |
| dc.date.accessioned | 2026-05-25T01:58:40Z | |
| dc.date.available | 2026-05-25T01:58:40Z | |
| dc.date.issued | 2026-05-12 | |
| dc.description.abstract | Objectives: This systematic review and meta-analysis assessed the performance of large language models (LLMs) in nursing licensure examinations. Despite the increasing use of LLMs in healthcare education, their capabilities in nursing licensure examinations remain uncertain. This study provides evidence on the accuracy and limitations of LLMs to help guide their integration into nursing education and licensure. Design: The systematic review and meta-analysis adhered to PRISMA 2020 guidelines. Data sources: PubMed, CINAHL, PsycINFO, EMCARE, and ERIC were searched from April to June 2025. Eligibility criteria: Studies were eligible if they evaluated LLMs (e.g., GPT-4, ChatGPT, Qwen-2.5) using multiple-choice nursing licensure questions under exam-like conditions and reported quantitative accuracy. Open-ended items were excluded from the meta-analysis due to incompatible scoring methods, but were narratively synthesised. Review methods: Two reviewers independently screened, extracted data, and appraised the risk of bias. A random-effects meta-analysis estimated pooled accuracy; subgroup and meta-regression analyses explored heterogeneity. Results: Twelve studies assessed 13,870 MCQs across seven exam systems and ten LLMs. Pooled accuracy was 69.6% (95% CI: 65.6–73.6%) with substantial heterogeneity (I<sup>2</sup> = 98%). GPT-4 outperformed GPT-3.5 (77.2% vs. 60.4%); domain-customised and newer models reached 93.6%. LLMs excelled in general medicine and pharmacology but underperformed in ethics and psychosocial integrity. Accuracy did not differ significantly by exam system (p = 0.14), question difficulty (p = 0.90) or format (p = 0.96). In meta-regression, Custom GPT (p = 0.0006) and Qwen 2.5 (p = 0.026) were the only significant predictors of higher accuracy; no exam system, question format, or difficulty level reached significance. Methodological variability and underreporting of model parameters were common. Conclusions: LLMs show promise for low-stakes educational applications, such as formative assessments within hybrid teaching models; however, they are unsuitable for unmoderated, high-stakes licensure decisions due to inconsistent performance. Regulatory guidelines, equitable access, and nursing-specific model development are needed to ensure fairness and validity. Research must prioritise standardised frameworks, error analysis, and broader geographic representation to address these limitations. | |
| dc.identifier.citation | Nurse Education Today, ISSN: 0260-6917 (Print); 1532-2793 (Online), Elsevier BV, 165, 107154-. doi: 10.1016/j.nedt.2026.107154 | |
| dc.identifier.doi | 10.1016/j.nedt.2026.107154 | |
| dc.identifier.issn | 0260-6917 | |
| dc.identifier.issn | 1532-2793 | |
| dc.identifier.uri | http://hdl.handle.net/10292/21211 | |
| dc.language | eng | |
| dc.publisher | Elsevier BV | |
| dc.relation.uri | https://www.sciencedirect.com/science/article/pii/S0260691726001826 | |
| dc.rights | © 2026 The Author(s). Published by Elsevier Ltd. This is an open access article distributed under the terms of the Creative Commons CC-BY license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. You are not required to obtain permission to reuse this article. | |
| dc.rights.accessrights | OpenAccess | |
| dc.rights.uri | https://creativecommons.org/licenses/by/4.0/ | |
| dc.subject | Artificial Intelligence | |
| dc.subject | Educational measurement | |
| dc.subject | Large language models | |
| dc.subject | Licensure | |
| dc.subject | Nursing | |
| dc.subject | Systematic review | |
| dc.subject | 4203 Health Services and Systems | |
| dc.subject | 42 Health Sciences | |
| dc.subject | 1110 Nursing | |
| dc.subject | 1302 Curriculum and Pedagogy | |
| dc.subject | 3901 Curriculum and pedagogy | |
| dc.subject | 4204 Midwifery | |
| dc.subject | 4205 Nursing | |
| dc.title | Performance of Large Language Models on Nursing Licensure Examinations: A Systematic Review and Meta-analysis | |
| dc.type | Journal Article | |
| pubs.elements-id | 761826 |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Amankwaa et al._2026_Performance of large language models on nursing licensure examinations_.pdf
- Size:
- 924.81 KB
- Format:
- Adobe Portable Document Format
- Description:
- Journal article
License bundle
1 - 1 of 1
