Evaluation of AI-generated responses by different artificial intelligence chatbots to the clinical decision-making case-based questions in oral and maxillofacial surgery.-Z研学术

Evaluation of AI-generated responses by different artificial intelligence chatbots to the clinical decision-making case-based questions in oral and maxillofacial surgery.

来自 PUBMED

作者：

Azadi A ， Gorjinejad F ， Mohammad-Rahimi H ， Tabrizi R ， Alam M ， Golkar M

展开 

摘要：

This study aims to evaluate the correctness of the generated answers by Google Bard, GPT-3.5, GPT-4, Claude-Instant, and Bing chatbots to decision-making clinical questions in the oral and maxillofacial surgery (OMFS) area. A group of 3 board-certified oral and maxillofacial surgeons designed a questionnaire with 50 case-based questions in multiple-choice and open-ended formats. Answers of chatbots to multiple-choice questions were examined against the chosen option by 3 referees. The chatbots' answers to the open-ended questions were evaluated based on the modified global quality scale. A P-value under .05 was considered significant. Bard, GPT-3.5, GPT-4, Claude-Instant, and Bing answered 34%, 36%, 38%, 38%, and 26% of the questions correctly, respectively. In open-ended questions, GPT-4 scored the most answers evaluated as grades "4" or "5," and Bing scored the most answers evaluated as grades "1" or "2." There were no statistically significant differences between the 5 chatbots in responding to the open-ended (P = .275) and multiple-choice (P = .699) questions. Considering the major inaccuracies in the responses of chatbots, despite their relatively good performance in answering open-ended questions, this technology yet cannot be trusted as a consultant for clinicians in decision-making situations.

收起

展开 

DOI：

10.1016/j.oooo.2024.02.018

被引量：

年份：

1970

全部来源

SCI-Hub (全网免费下载)

发表链接

ResearchGate (全网免费下载)

钛学术 (全网免费下载)

通过文献互助平台发起求助，成功后即可免费获取论文全文。

查看求助

求助方法1：

知识发现用户

每天可免费求助50篇

求助

求助方法1：

关注微信公众号

每天可免费求助2篇

求助方法2：

求助需要支付5个财富值

您现在财富值不足

您可以通过应助全文获取财富值

求助方法2：

完成求助需要支付5财富值

您目前有 1000 财富值

求助

我们已与文献出版商建立了直接购买合作。

你可以通过身份认证进行实名认证，认证成功后本次下载的费用将由您所在的图书馆支付

您可以直接购买此文献，1~5分钟即可下载全文，部分资源由于网络原因可能需要更长时间，请您耐心等待哦~

身份认证全文购买

相似文献(130)

参考文献(0)

引证文献(0)

Evaluation of AI-generated responses by different artificial intelligence chatbots to the clinical decision-making case-based questions in oral and maxillofacial surgery.

This study aims to evaluate the correctness of the generated answers by Google Bard, GPT-3.5, GPT-4, Claude-Instant, and Bing chatbots to decision-making clinical questions in the oral and maxillofacial surgery (OMFS) area. A group of 3 board-certified oral and maxillofacial surgeons designed a questionnaire with 50 case-based questions in multiple-choice and open-ended formats. Answers of chatbots to multiple-choice questions were examined against the chosen option by 3 referees. The chatbots' answers to the open-ended questions were evaluated based on the modified global quality scale. A P-value under .05 was considered significant. Bard, GPT-3.5, GPT-4, Claude-Instant, and Bing answered 34%, 36%, 38%, 38%, and 26% of the questions correctly, respectively. In open-ended questions, GPT-4 scored the most answers evaluated as grades "4" or "5," and Bing scored the most answers evaluated as grades "1" or "2." There were no statistically significant differences between the 5 chatbots in responding to the open-ended (P = .275) and multiple-choice (P = .699) questions. Considering the major inaccuracies in the responses of chatbots, despite their relatively good performance in answering open-ended questions, this technology yet cannot be trusted as a consultant for clinicians in decision-making situations.

Azadi A ，Gorjinejad F ，Mohammad-Rahimi H ，Tabrizi R ，Alam M ，Golkar M ... - 《-》

被引量: - 发表:1970年
Performance of AI chatbots on controversial topics in oral medicine, pathology, and radiology.

In this study, we assessed 6 different artificial intelligence (AI) chatbots (Bing, GPT-3.5, GPT-4, Google Bard, Claude, Sage) responses to controversial and difficult questions in oral pathology, oral medicine, and oral radiology. The chatbots' answers were evaluated by board-certified specialists using a modified version of the global quality score on a 5-point Likert scale. The quality and validity of chatbot citations were evaluated. Claude had the highest mean score of 4.341 ± 0.582 for oral pathology and medicine. Bing had the lowest scores of 3.447 ± 0.566. In oral radiology, GPT-4 had the highest mean score of 3.621 ± 1.009 and Bing the lowest score of 2.379 ± 0.978. GPT-4 achieved the highest mean score of 4.066 ± 0.825 for performance across all disciplines. 82 out of 349 (23.50%) of generated citations from chatbots were fake. The most superior chatbot in providing high-quality information for controversial topics in various dental disciplines was GPT-4. Although the majority of chatbots performed well, it is suggested that developers of AI medical chatbots incorporate scientific citation authenticators to validate the outputted citations given the relatively high number of fabricated citations.

Mohammad-Rahimi H ，Khoury ZH ，Alamdari MI ，Rokhshad R ，Motie P ，Parsa A ，Tavares T ，Sciubba JJ ，Price JB ，Sultan AS ... - 《-》

被引量: - 发表:1970年
Comparison of artificial intelligence large language model chatbots in answering frequently asked questions in anaesthesia.

Patients are increasingly using artificial intelligence (AI) chatbots to seek answers to medical queries. Ten frequently asked questions in anaesthesia were posed to three AI chatbots: ChatGPT4 (OpenAI), Bard (Google), and Bing Chat (Microsoft). Each chatbot's answers were evaluated in a randomised, blinded order by five residency programme directors from 15 medical institutions in the USA. Three medical content quality categories (accuracy, comprehensiveness, safety) and three communication quality categories (understandability, empathy/respect, and ethics) were scored between 1 and 5 (1 representing worst, 5 representing best). ChatGPT4 and Bard outperformed Bing Chat (median [inter-quartile range] scores: 4 [3-4], 4 [3-4], and 3 [2-4], respectively; P<0.001 with all metrics combined). All AI chatbots performed poorly in accuracy (score of ≥4 by 58%, 48%, and 36% of experts for ChatGPT4, Bard, and Bing Chat, respectively), comprehensiveness (score ≥4 by 42%, 30%, and 12% of experts for ChatGPT4, Bard, and Bing Chat, respectively), and safety (score ≥4 by 50%, 40%, and 28% of experts for ChatGPT4, Bard, and Bing Chat, respectively). Notably, answers from ChatGPT4, Bard, and Bing Chat differed statistically in comprehensiveness (ChatGPT4, 3 [2-4] vs Bing Chat, 2 [2-3], P<0.001; and Bard 3 [2-4] vs Bing Chat, 2 [2-3], P=0.002). All large language model chatbots performed well with no statistical difference for understandability (P=0.24), empathy (P=0.032), and ethics (P=0.465). In answering anaesthesia patient frequently asked questions, the chatbots perform well on communication metrics but are suboptimal for medical content metrics. Overall, ChatGPT4 and Bard were comparable to each other, both outperforming Bing Chat.

Nguyen TP ，Carvalho B ，Sukhdeo H ，Joudi K ，Guo N ，Chen M ，Wolpaw JT ，Kiefer JJ ，Byrne M ，Jamroz T ，Mootz AA ，Reale SC ，Zou J ，Sultan P ... - 《-》

被引量: - 发表:1970年
Validity and reliability of artificial intelligence chatbots as public sources of information on endodontics.

Mohammad-Rahimi H ，Ourang SA ，Pourhoseingholi MA ，Dianat O ，Dummer PMH ，Nosrat A ... - 《-》

被引量: 4 发表:1970年
Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study.

Torres-Zegarra BC ，Rios-Garcia W ，Ñaña-Cordova AM ，Arteaga-Cisneros KF ，Chalco XCB ，Ordoñez MAB ，Rios CJG ，Godoy CAR ，Quezada KLTP ，Gutierrez-Arratia JD ，Flores-Cohaila JA ... - 《-》

被引量: - 发表:1970年

加载更多

来源期刊

影响因子：暂无数据

JCR分区：暂无

中科院分区：暂无