Accuracy and consistency of online large language model-based artificial intelligence chat platforms in answering patients' questions about heart failure.
Heart failure (HF) is a prevalent condition associated with significant morbidity. Patients may have questions that they feel embarrassed to ask or will face delays awaiting responses from their healthcare providers which may impact their health behavior. We aimed to investigate the potential of large language model (LLM) based artificial intelligence (AI) chat platforms in complementing the delivery of patient-centered care.
Using online patient forums and physician experience, we created 30 questions related to diagnosis, management and prognosis of HF. The questions were posed to two LLM-based AI chat platforms (OpenAI's ChatGPT-3.5 and Google's Bard). Each set of answers was evaluated by two HF experts, independently and blinded to each other, for accuracy (adequacy of content) and consistency of content.
ChatGPT provided mostly appropriate answers (27/30, 90%) and showed a high degree of consistency (93%). Bard provided a similar content in its answers and thus was evaluated only for adequacy (23/30, 77%). The two HF experts' grades were concordant in 83% and 67% of the questions for ChatGPT and Bard, respectively.
LLM-based AI chat platforms demonstrate potential in improving HF education and empowering patients, however, these platforms currently suffer from issues related to factual errors and difficulty with more contemporary recommendations. This inaccurate information may pose serious and life-threatening implications for patients that should be considered and addressed in future research.
Kozaily E
,Geagea M
,Akdogan ER
,Atkins J
,Elshazly MB
,Guglin M
,Tedford RJ
,Wehbe RM
... -
《-》
Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.
The increasing utilization of large language models (LLMs) in Generative Artificial Intelligence across various medical and dental fields, and specifically orthodontics, raises questions about their accuracy.
This study aimed to assess and compare the answers offered by four LLMs: Google's Bard, OpenAI's ChatGPT-3.5, and ChatGPT-4, and Microsoft's Bing, in response to clinically relevant questions within the field of orthodontics.
Ten open-type clinical orthodontics-related questions were posed to the LLMs. The responses provided by the LLMs were assessed on a scale ranging from 0 (minimum) to 10 (maximum) points, benchmarked against robust scientific evidence, including consensus statements and systematic reviews, using a predefined rubric. After a 4-week interval from the initial evaluation, the answers were reevaluated to gauge intra-evaluator reliability. Statistical comparisons were conducted on the scores using Friedman's and Wilcoxon's tests to identify the model providing the answers with the most comprehensiveness, scientific accuracy, clarity, and relevance.
Overall, no statistically significant differences between the scores given by the two evaluators, on both scoring occasions, were detected, so an average score for every LLM was computed. The LLM answers scoring the highest, were those of Microsoft Bing Chat (average score = 7.1), followed by ChatGPT 4 (average score = 4.7), Google Bard (average score = 4.6), and finally ChatGPT 3.5 (average score 3.8). While Microsoft Bing Chat statistically outperformed ChatGPT-3.5 (P-value = 0.017) and Google Bard (P-value = 0.029), as well, and Chat GPT-4 outperformed Chat GPT-3.5 (P-value = 0.011), all models occasionally produced answers with a lack of comprehensiveness, scientific accuracy, clarity, and relevance.
The questions asked were indicative and did not cover the entire field of orthodontics.
Language models (LLMs) show great potential in supporting evidence-based orthodontics. However, their current limitations pose a potential risk of making incorrect healthcare decisions if utilized without careful consideration. Consequently, these tools cannot serve as a substitute for the orthodontist's essential critical thinking and comprehensive subject knowledge. For effective integration into practice, further research, clinical validation, and enhancements to the models are essential. Clinicians must be mindful of the limitations of LLMs, as their imprudent utilization could have adverse effects on patient care.
Makrygiannakis MA
,Giannakopoulos K
,Kaklamanos EG
《-》
The performance of artificial intelligence chatbot large language models to address skeletal biology and bone health queries.
Artificial intelligence (AI) chatbots utilizing large language models (LLMs) have recently garnered significant interest due to their ability to generate humanlike responses to user inquiries in an interactive dialog format. While these models are being increasingly utilized to obtain medical information by patients, scientific and medical providers, and trainees to address biomedical questions, their performance may vary from field to field. The opportunities and risks these chatbots pose to the widespread understanding of skeletal health and science are unknown. Here we assess the performance of 3 high-profile LLM chatbots, Chat Generative Pre-Trained Transformer (ChatGPT) 4.0, BingAI, and Bard, to address 30 questions in 3 categories: basic and translational skeletal biology, clinical practitioner management of skeletal disorders, and patient queries to assess the accuracy and quality of the responses. Thirty questions in each of these categories were posed, and responses were independently graded for their degree of accuracy by four reviewers. While each of the chatbots was often able to provide relevant information about skeletal disorders, the quality and relevance of these responses varied widely, and ChatGPT 4.0 had the highest overall median score in each of the categories. Each of these chatbots displayed distinct limitations that included inconsistent, incomplete, or irrelevant responses, inappropriate utilization of lay sources in a professional context, a failure to take patient demographics or clinical context into account when providing recommendations, and an inability to consistently identify areas of uncertainty in the relevant literature. Careful consideration of both the opportunities and risks of current AI chatbots is needed to formulate guidelines for best practices for their use as source of information about skeletal health and biology.
Cung M
,Sosa B
,Yang HS
,McDonald MM
,Matthews BG
,Vlug AG
,Imel EA
,Wein MN
,Stein EM
,Greenblatt MB
... -
《-》
Performance of artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in the American Society for Metabolic and Bariatric Surgery textbook of bariatric surgery questions.
The American Society for Metabolic and Bariatric Surgery (ASMBS) textbook serves as a comprehensive resource for bariatric surgery, covering recent advancements and clinical questions. Testing artificial intelligence (AI) engines using this authoritative source ensures accurate and up-to-date information and provides insight in its potential implications for surgical education and training.
To determine the quality and to compare different large language models' (LLMs) ability to respond to textbook questions relating to bariatric surgery.
Remote.
Prompts to be entered into the LLMs were multiple-choice questions found in "The ASMBS Textbook of Bariatric Surgery, second Edition. The prompts were queried into 3 LLMs: OpenAI's ChatGPT-4, Microsoft's Bing, and Google's Bard. The generated responses were assessed based on overall accuracy, the number of correct answers according to subject matter, and the number of correct answers based on question type. Statistical analysis was performed to determine the number of responses per LLMs per category that were correct.
Two hundred questions were used to query the AI models. There was an overall significant difference in the accuracy of answers, with an accuracy of 83.0% for ChatGPT-4, followed by Bard (76.0%) and Bing (65.0%). Subgroup analysis revealed a significant difference between the models' performance in question categories, with ChatGPT-4's demonstrating the highest proportion of correct answers in questions related to treatment and surgical procedures (83.1%) and complications (91.7%). There was also a significant difference between the performance in different question types, with ChatGPT-4 showing superior performance in inclusionary questions. Bard and Bing were unable to answer certain questions whereas ChatGPT-4 left no questions unanswered.
LLMs, particularly ChatGPT-4, demonstrated promising accuracy when answering clinical questions related to bariatric surgery. Continued AI advancements and research is required to elucidate the potential applications of LLMs in training and education.
Lee Y
,Tessier L
,Brar K
,Malone S
,Jin D
,McKechnie T
,Jung JJ
,Kroh M
,Dang JT
,ASMBS Artificial Intelligence and Digital Surgery Taskforce
... -
《-》