Performance of ChatGPT on Solving Orthopedic Board-Style Questions: A Comparative Analysis of ChatGPT 3.5 and ChatGPT 4.-Z研学术

Performance of ChatGPT on Solving Orthopedic Board-Style Questions: A Comparative Analysis of ChatGPT 3.5 and ChatGPT 4.

来自 PUBMED

作者：

Kim SE ， Lee JH ， Choi BS ， Han HS ， Lee MC ， Ro DH

展开 

摘要：

The application of artificial intelligence and large language models in the medical field requires an evaluation of their accuracy in providing medical information. This study aimed to assess the performance of Chat Generative Pre-trained Transformer (ChatGPT) models 3.5 and 4 in solving orthopedic board-style questions. A total of 160 text-only questions from the Orthopedic Surgery Department at Seoul National University Hospital, conforming to the format of the Korean Orthopedic Association board certification examinations, were input into the ChatGPT 3.5 and ChatGPT 4 programs. The questions were divided into 11 subcategories. The accuracy rates of the initial answers provided by Chat GPT 3.5 and ChatGPT 4 were analyzed. In addition, inconsistency rates of answers were evaluated by regenerating the responses. ChatGPT 3.5 answered 37.5% of the questions correctly, while ChatGPT 4 showed an accuracy rate of 60.0% (p < 0.001). ChatGPT 4 demonstrated superior performance across most subcategories, except for the tumor-related questions. The rates of inconsistency in answers were 47.5% for ChatGPT 3.5 and 9.4% for ChatGPT 4. ChatGPT 4 showed the ability to pass orthopedic board-style examinations, outperforming ChatGPT 3.5 in accuracy rate. However, inconsistencies in response generation and instances of incorrect answers with misleading explanations require caution when applying ChatGPT in clinical settings or for educational purposes.

收起

展开 

DOI：

10.4055/cios23179

被引量：

年份：

1970

全部来源

SCI-Hub (全网免费下载)

发表链接

ResearchGate (全网免费下载)

钛学术 (全网免费下载)

通过文献互助平台发起求助，成功后即可免费获取论文全文。

查看求助

求助方法1：

知识发现用户

每天可免费求助50篇

求助

求助方法1：

关注微信公众号

每天可免费求助2篇

求助方法2：

求助需要支付5个财富值

您现在财富值不足

您可以通过应助全文获取财富值

求助方法2：

完成求助需要支付5财富值

您目前有 1000 财富值

求助

我们已与文献出版商建立了直接购买合作。

你可以通过身份认证进行实名认证，认证成功后本次下载的费用将由您所在的图书馆支付

您可以直接购买此文献，1~5分钟即可下载全文，部分资源由于网络原因可能需要更长时间，请您耐心等待哦~

身份认证全文购买

相似文献(226)

参考文献(16)

引证文献(1)

Performance of ChatGPT on Solving Orthopedic Board-Style Questions: A Comparative Analysis of ChatGPT 3.5 and ChatGPT 4.

The application of artificial intelligence and large language models in the medical field requires an evaluation of their accuracy in providing medical information. This study aimed to assess the performance of Chat Generative Pre-trained Transformer (ChatGPT) models 3.5 and 4 in solving orthopedic board-style questions. A total of 160 text-only questions from the Orthopedic Surgery Department at Seoul National University Hospital, conforming to the format of the Korean Orthopedic Association board certification examinations, were input into the ChatGPT 3.5 and ChatGPT 4 programs. The questions were divided into 11 subcategories. The accuracy rates of the initial answers provided by Chat GPT 3.5 and ChatGPT 4 were analyzed. In addition, inconsistency rates of answers were evaluated by regenerating the responses. ChatGPT 3.5 answered 37.5% of the questions correctly, while ChatGPT 4 showed an accuracy rate of 60.0% (p < 0.001). ChatGPT 4 demonstrated superior performance across most subcategories, except for the tumor-related questions. The rates of inconsistency in answers were 47.5% for ChatGPT 3.5 and 9.4% for ChatGPT 4. ChatGPT 4 showed the ability to pass orthopedic board-style examinations, outperforming ChatGPT 3.5 in accuracy rate. However, inconsistencies in response generation and instances of incorrect answers with misleading explanations require caution when applying ChatGPT in clinical settings or for educational purposes.

Kim SE ，Lee JH ，Choi BS ，Han HS ，Lee MC ，Ro DH ... - 《-》

被引量: 1 发表:1970年
ChatGPT performance on the American Shoulder and Elbow Surgeons maintenance of certification exam.

Fiedler B ，Azua EN ，Phillips T ，Ahmed AS ... - 《-》

被引量: 2 发表:1970年
The Rapid Development of Artificial Intelligence: GPT-4's Performance on Orthopedic Surgery Board Questions.

Hofmann HL ，Guerra GA ，Le JL ，Wong AM ，Hofmann GH ，Mayfield CK ，Petrigliano FA ，Liu JN ... - 《-》

被引量: 6 发表:1970年
Performance of ChatGPT on American Board of Surgery In-Training Examination Preparation Questions.

Chat Generative Pretrained Transformer (ChatGPT) is a large language model capable of generating human-like text. This study sought to evaluate ChatGPT's performance on Surgical Council on Resident Education (SCORE) self-assessment questions. General surgery multiple choice questions were randomly selected from the SCORE question bank. ChatGPT (GPT-3.5, April-May 2023) evaluated questions and responses were recorded. ChatGPT correctly answered 123 of 200 questions (62%). ChatGPT scored lowest on biliary (2/8 questions correct, 25%), surgical critical care (3/10, 30%), general abdomen (1/3, 33%), and pancreas (1/3, 33%) topics. ChatGPT scored higher on biostatistics (4/4 correct, 100%), fluid/electrolytes/acid-base (4/4, 100%), and small intestine (8/9, 89%) questions. ChatGPT answered questions with thorough and structured support for its answers. It scored 56% on ethics questions and provided coherent explanations regarding end-of-life discussions, communication with coworkers and patients, and informed consent. For many questions answered incorrectly, ChatGPT provided cogent, yet factually incorrect descriptions, including anatomy and steps of operations. In two instances, it gave a correct explanation but chose the wrong answer. It did not answer two questions, stating it needed additional information to determine the next best step in treatment. ChatGPT answered 62% of SCORE questions correctly. It performed better at questions requiring standard recall but struggled with higher-level questions that required complex clinical decision making, despite providing detailed responses behind its rationale. Due to its mediocre performance on this question set and sometimes confidently-worded, yet factually inaccurate responses, caution should be used when interpreting ChatGPT's answers to general surgery questions.

Tran CG ，Chang J ，Sherman SK ，De Andrade JP ... - 《-》

被引量: - 发表:1970年
Comparison of the problem-solving performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Korean emergency medicine board examination question bank.

Lee GU ，Hong DY ，Kim SY ，Kim JW ，Lee YH ，Park SO ，Lee KR ... - 《-》

被引量: - 发表:2024年

加载更多

来源期刊

影响因子：暂无数据

JCR分区：暂无

中科院分区：暂无