Evaluation of AI-generated responses by different artificial intelligence chatbots to the clinical decision-making case-based questions in oral and maxillofacial surgery.
This study aims to evaluate the correctness of the generated answers by Google Bard, GPT-3.5, GPT-4, Claude-Instant, and Bing chatbots to decision-making clinical questions in the oral and maxillofacial surgery (OMFS) area.
A group of 3 board-certified oral and maxillofacial surgeons designed a questionnaire with 50 case-based questions in multiple-choice and open-ended formats. Answers of chatbots to multiple-choice questions were examined against the chosen option by 3 referees. The chatbots' answers to the open-ended questions were evaluated based on the modified global quality scale. A P-value under .05 was considered significant.
Bard, GPT-3.5, GPT-4, Claude-Instant, and Bing answered 34%, 36%, 38%, 38%, and 26% of the questions correctly, respectively. In open-ended questions, GPT-4 scored the most answers evaluated as grades "4" or "5," and Bing scored the most answers evaluated as grades "1" or "2." There were no statistically significant differences between the 5 chatbots in responding to the open-ended (P = .275) and multiple-choice (P = .699) questions.
Considering the major inaccuracies in the responses of chatbots, despite their relatively good performance in answering open-ended questions, this technology yet cannot be trusted as a consultant for clinicians in decision-making situations.
Azadi A
,Gorjinejad F
,Mohammad-Rahimi H
,Tabrizi R
,Alam M
,Golkar M
... -
《-》
Performance of AI chatbots on controversial topics in oral medicine, pathology, and radiology.
In this study, we assessed 6 different artificial intelligence (AI) chatbots (Bing, GPT-3.5, GPT-4, Google Bard, Claude, Sage) responses to controversial and difficult questions in oral pathology, oral medicine, and oral radiology.
The chatbots' answers were evaluated by board-certified specialists using a modified version of the global quality score on a 5-point Likert scale. The quality and validity of chatbot citations were evaluated.
Claude had the highest mean score of 4.341 ± 0.582 for oral pathology and medicine. Bing had the lowest scores of 3.447 ± 0.566. In oral radiology, GPT-4 had the highest mean score of 3.621 ± 1.009 and Bing the lowest score of 2.379 ± 0.978. GPT-4 achieved the highest mean score of 4.066 ± 0.825 for performance across all disciplines. 82 out of 349 (23.50%) of generated citations from chatbots were fake.
The most superior chatbot in providing high-quality information for controversial topics in various dental disciplines was GPT-4. Although the majority of chatbots performed well, it is suggested that developers of AI medical chatbots incorporate scientific citation authenticators to validate the outputted citations given the relatively high number of fabricated citations.
Mohammad-Rahimi H
,Khoury ZH
,Alamdari MI
,Rokhshad R
,Motie P
,Parsa A
,Tavares T
,Sciubba JJ
,Price JB
,Sultan AS
... -
《-》
Comparison of artificial intelligence large language model chatbots in answering frequently asked questions in anaesthesia.
Patients are increasingly using artificial intelligence (AI) chatbots to seek answers to medical queries.
Ten frequently asked questions in anaesthesia were posed to three AI chatbots: ChatGPT4 (OpenAI), Bard (Google), and Bing Chat (Microsoft). Each chatbot's answers were evaluated in a randomised, blinded order by five residency programme directors from 15 medical institutions in the USA. Three medical content quality categories (accuracy, comprehensiveness, safety) and three communication quality categories (understandability, empathy/respect, and ethics) were scored between 1 and 5 (1 representing worst, 5 representing best).
ChatGPT4 and Bard outperformed Bing Chat (median [inter-quartile range] scores: 4 [3-4], 4 [3-4], and 3 [2-4], respectively; P<0.001 with all metrics combined). All AI chatbots performed poorly in accuracy (score of ≥4 by 58%, 48%, and 36% of experts for ChatGPT4, Bard, and Bing Chat, respectively), comprehensiveness (score ≥4 by 42%, 30%, and 12% of experts for ChatGPT4, Bard, and Bing Chat, respectively), and safety (score ≥4 by 50%, 40%, and 28% of experts for ChatGPT4, Bard, and Bing Chat, respectively). Notably, answers from ChatGPT4, Bard, and Bing Chat differed statistically in comprehensiveness (ChatGPT4, 3 [2-4] vs Bing Chat, 2 [2-3], P<0.001; and Bard 3 [2-4] vs Bing Chat, 2 [2-3], P=0.002). All large language model chatbots performed well with no statistical difference for understandability (P=0.24), empathy (P=0.032), and ethics (P=0.465).
In answering anaesthesia patient frequently asked questions, the chatbots perform well on communication metrics but are suboptimal for medical content metrics. Overall, ChatGPT4 and Bard were comparable to each other, both outperforming Bing Chat.
Nguyen TP
,Carvalho B
,Sukhdeo H
,Joudi K
,Guo N
,Chen M
,Wolpaw JT
,Kiefer JJ
,Byrne M
,Jamroz T
,Mootz AA
,Reale SC
,Zou J
,Sultan P
... -
《-》