Currently Available Large Language Models Do Not Provide Musculoskeletal Treatment Recommendations That Are Concordant With Evidence-Based Clinical Practice Guidelines.

来自 PUBMED

作者:

Nwachukwu BUVarady NHAllen AADines JSAltchek DWWilliams RJ 3rdKunze KN

展开

摘要:

To determine whether several leading, commercially available large language models (LLMs) provide treatment recommendations concordant with evidence-based clinical practice guidelines (CPGs) developed by the American Academy of Orthopaedic Surgeons (AAOS). All CPGs concerning the management of rotator cuff tears (n = 33) and anterior cruciate ligament injuries (n = 15) were extracted from the AAOS. Treatment recommendations from Chat-Generative Pretrained Transformer version 4 (ChatGPT-4), Gemini, Mistral-7B, and Claude-3 were graded by 2 blinded physicians as being concordant, discordant, or indeterminate (i.e., neutral response without definitive recommendation) with respect to AAOS CPGs. The overall concordance between LLM and AAOS recommendations was quantified, and the comparative overall concordance of recommendations among the 4 LLMs was evaluated through the Fisher exact test. Overall, 135 responses (70.3%) were concordant, 43 (22.4%) were indeterminate, and 14 (7.3%) were discordant. Inter-rater reliability for concordance classification was excellent (κ = 0.92). Concordance with AAOS CPGs was most frequently observed with ChatGPT-4 (n = 38, 79.2%) and least frequently observed with Mistral-7B (n = 28, 58.3%). Indeterminate recommendations were most frequently observed with Mistral-7B (n = 17, 35.4%) and least frequently observed with Claude-3 (n = 8, 6.7%). Discordant recommendations were most frequently observed with Gemini (n = 6, 12.5%) and least frequently observed with ChatGPT-4 (n = 1, 2.1%). Overall, no statistically significant difference in concordant recommendations was observed across LLMs (P = .12). Of all recommendations, only 20 (10.4%) were transparent and provided references with full bibliographic details or links to specific peer-reviewed content to support recommendations. Among leading commercially available LLMs, more than 1-in-4 recommendations concerning the evaluation and management of rotator cuff and anterior cruciate ligament injuries do not reflect current evidence-based CPGs. Although ChatGPT-4 showed the highest performance, clinically significant rates of recommendations without concordance or supporting evidence were observed. Only 10% of responses by LLMs were transparent, precluding users from fully interpreting the sources from which recommendations were provided. Although leading LLMs generally provide recommendations concordant with CPGs, a substantial error rate exists, and the proportion of recommendations that do not align with these CPGs suggests that LLMs are not trustworthy clinical support tools at this time. Each off-the-shelf, closed-source LLM has strengths and weaknesses. Future research should evaluate and compare multiple LLMs to avoid bias associated with narrow evaluation of few models as observed in the current literature.

收起

展开

DOI:

10.1016/j.arthro.2024.07.040

被引量:

4

年份:

1970

SCI-Hub (全网免费下载) 发表链接

通过 文献互助 平台发起求助,成功后即可免费获取论文全文。

查看求助

求助方法1:

知识发现用户

每天可免费求助50篇

求助

求助方法1:

关注微信公众号

每天可免费求助2篇

求助方法2:

求助需要支付5个财富值

您现在财富值不足

您可以通过 应助全文 获取财富值

求助方法2:

完成求助需要支付5财富值

您目前有 1000 财富值

求助

我们已与文献出版商建立了直接购买合作。

你可以通过身份认证进行实名认证,认证成功后本次下载的费用将由您所在的图书馆支付

您可以直接购买此文献,1~5分钟即可下载全文,部分资源由于网络原因可能需要更长时间,请您耐心等待哦~

身份认证 全文购买

相似文献(100)

参考文献(0)

引证文献(4)

来源期刊

-

影响因子:暂无数据

JCR分区: 暂无

中科院分区:暂无

研究点推荐

关于我们

zlive学术集成海量学术资源,融合人工智能、深度学习、大数据分析等技术,为科研工作者提供全面快捷的学术服务。在这里我们不忘初心,砥砺前行。

友情链接

联系我们

合作与服务

©2024 zlive学术声明使用前必读