-
Harnessing artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in generating clinician-level bariatric surgery recommendations.
The formulation of clinical recommendations pertaining to bariatric surgery is essential in guiding healthcare professionals. However, the extensive and continuously evolving body of literature in bariatric surgery presents considerable challenge for staying abreast of latest developments and efficient information acquisition. Artificial intelligence (AI) has the potential to streamline access to the salient points of clinical recommendations in bariatric surgery.
The study aims to appraise the quality and readability of AI-chat-generated answers to frequently asked clinical inquiries in the field of bariatric and metabolic surgery.
Remote.
Question prompts inputted into AI large language models (LLMs) and were created based on pre-existing clinical practice guidelines regarding bariatric and metabolic surgery. The prompts were queried into 3 LLMs: OpenAI ChatGPT-4, Microsoft Bing, and Google Bard. The responses from each LLM were entered into a spreadsheet for randomized and blinded duplicate review. Accredited bariatric surgeons in North America independently assessed appropriateness of each recommendation using a 5-point Likert scale. Scores of 4 and 5 were deemed appropriate, while scores of 1-3 indicated lack of appropriateness. A Flesch Reading Ease (FRE) score was calculated to assess the readability of responses generated by each LLMs.
There was a significant difference between the 3 LLMs in their 5-point Likert scores, with mean values of 4.46 (SD .82), 3.89 (.80), and 3.11 (.72) for ChatGPT-4, Bard, and Bing (P < .001). There was a significant difference between the 3 LLMs in the proportion of appropriate answers, with ChatGPT-4 at 85.7%, Bard at 74.3%, and Bing at 25.7% (P < .001). The mean FRE scores for ChatGPT-4, Bard, and Bing, were 21.68 (SD 2.78), 42.89 (4.03), and 14.64 (5.09), respectively, with higher scores representing easier readability.
LLM-based AI chat models can effectively generate appropriate responses to clinical questions related to bariatric surgery, though the performance of different models can vary greatly. Therefore, caution should be taken when interpreting clinical information provided by LLMs, and clinician oversight is necessary to ensure accuracy. Future investigation is warranted to explore how LLMs might enhance healthcare provision and clinical decision-making in bariatric surgery.
Lee Y
,Shin T
,Tessier L
,Javidan A
,Jung J
,Hong D
,Strong AT
,McKechnie T
,Malone S
,Jin D
,Kroh M
,Dang JT
,ASMBS Artificial Intelligence and Digital Surgery Task Force
... -
《-》
-
PICOT questions and search strategies formulation: A novel approach using artificial intelligence automation.
The aim of this study was to evaluate and compare artificial intelligence (AI)-based large language models (LLMs) (ChatGPT-3.5, Bing, and Bard) with human-based formulations in generating relevant clinical queries, using comprehensive methodological evaluations.
To interact with the major LLMs ChatGPT-3.5, Bing Chat, and Google Bard, scripts and prompts were designed to formulate PICOT (population, intervention, comparison, outcome, time) clinical questions and search strategies. Quality of the LLMs responses was assessed using a descriptive approach and independent assessment by two researchers. To determine the number of hits, PubMed, Web of Science, Cochrane Library, and CINAHL Ultimate search results were imported separately, without search restrictions, with the search strings generated by the three LLMs and an additional one by the expert. Hits from one of the scenarios were also exported for relevance evaluation. The use of a single scenario was chosen to provide a focused analysis. Cronbach's alpha and intraclass correlation coefficient (ICC) were also calculated.
In five different scenarios, ChatGPT-3.5 generated 11,859 hits, Bing 1,376,854, Bard 16,583, and an expert 5919 hits. We then used the first scenario to assess the relevance of the obtained results. The human expert search approach resulted in 65.22% (56/105) relevant articles. Bing was the most accurate AI-based LLM with 70.79% (63/89), followed by ChatGPT-3.5 with 21.05% (12/45), and Bard with 13.29% (42/316) relevant hits. Based on the assessment of two evaluators, ChatGPT-3.5 received the highest score (M = 48.50; SD = 0.71). Results showed a high level of agreement between the two evaluators. Although ChatGPT-3.5 showed a lower percentage of relevant hits compared to Bing, this reflects the nuanced evaluation criteria, where the subjective evaluation prioritized contextual accuracy and quality over mere relevance.
This study provides valuable insights into the ability of LLMs to formulate PICOT clinical questions and search strategies. AI-based LLMs, such as ChatGPT-3.5, demonstrate significant potential for augmenting clinical workflows, improving clinical query development, and supporting search strategies. However, the findings also highlight limitations that necessitate further refinement and continued human oversight.
AI could assist nurses in formulating PICOT clinical questions and search strategies. AI-based LLMs offer valuable support to healthcare professionals by improving the structure of clinical questions and enhancing search strategies, thereby significantly increasing the efficiency of information retrieval.
Gosak L
,Štiglic G
,Pruinelli L
,Vrbnjak D
... -
《-》
-
Examining the Role of Large Language Models in Orthopedics: Systematic Review.
Large language models (LLMs) can understand natural language and generate corresponding text, images, and even videos based on prompts, which holds great potential in medical scenarios. Orthopedics is a significant branch of medicine, and orthopedic diseases contribute to a significant socioeconomic burden, which could be alleviated by the application of LLMs. Several pioneers in orthopedics have conducted research on LLMs across various subspecialties to explore their performance in addressing different issues. However, there are currently few reviews and summaries of these studies, and a systematic summary of existing research is absent.
The objective of this review was to comprehensively summarize research findings on the application of LLMs in the field of orthopedics and explore the potential opportunities and challenges.
PubMed, Embase, and Cochrane Library databases were searched from January 1, 2014, to February 22, 2024, with the language limited to English. The terms, which included variants of "large language model," "generative artificial intelligence," "ChatGPT," and "orthopaedics," were divided into 2 categories: large language model and orthopedics. After completing the search, the study selection process was conducted according to the inclusion and exclusion criteria. The quality of the included studies was assessed using the revised Cochrane risk-of-bias tool for randomized trials and CONSORT-AI (Consolidated Standards of Reporting Trials-Artificial Intelligence) guidance. Data extraction and synthesis were conducted after the quality assessment.
A total of 68 studies were selected. The application of LLMs in orthopedics involved the fields of clinical practice, education, research, and management. Of these 68 studies, 47 (69%) focused on clinical practice, 12 (18%) addressed orthopedic education, 8 (12%) were related to scientific research, and 1 (1%) pertained to the field of management. Of the 68 studies, only 8 (12%) recruited patients, and only 1 (1%) was a high-quality randomized controlled trial. ChatGPT was the most commonly mentioned LLM tool. There was considerable heterogeneity in the definition, measurement, and evaluation of the LLMs' performance across the different studies. For diagnostic tasks alone, the accuracy ranged from 55% to 93%. When performing disease classification tasks, ChatGPT with GPT-4's accuracy ranged from 2% to 100%. With regard to answering questions in orthopedic examinations, the scores ranged from 45% to 73.6% due to differences in models and test selections.
LLMs cannot replace orthopedic professionals in the short term. However, using LLMs as copilots could be a potential approach to effectively enhance work efficiency at present. More high-quality clinical trials are needed in the future, aiming to identify optimal applications of LLMs and advance orthopedics toward higher efficiency and precision.
Zhang C
,Liu S
,Zhou X
,Zhou S
,Tian Y
,Wang S
,Xu N
,Li W
... -
《JOURNAL OF MEDICAL INTERNET RESEARCH》
-
Embracing Large Language Models for Adult Life Support Learning.
Background It is recognised that large language models (LLMs) may aid medical education by supporting the understanding of explanations behind answers to multiple choice questions. This study aimed to evaluate the efficacy of LLM chatbots ChatGPT and Bard in answering an Intermediate Life Support pre-course multiple choice question (MCQs) test developed by the Resuscitation Council UK focused on managing deteriorating patients and identifying causes and treating cardiac arrest. We assessed the accuracy of responses and quality of explanations to evaluate the utility of the chatbots. Methods The performance of the AI chatbots ChatGPT-3.5 and Bard were assessed on their ability to choose the correct answer and provide clear comprehensive explanations in answering MCQs developed by the Resuscitation Council UK for their Intermediate Life Support Course. Ten MCQs were tested with a total score of 40, with one point scored for each accurate response to each statement a-d. In a separate scoring, questions were scored out of 1 if all sub-statements a-d were correct, to give a total score out of 10 for the test. The explanations provided by the AI chatbots were evaluated by three qualified physicians as per a rating scale from 0-3 for each overall question and median rater scores calculated and compared. The Fleiss multi-rater kappa (κ) was used to determine the score agreement among the three raters. Results When scoring each overall question to give a total score out of 10, Bard outperformed ChatGPT although the difference was not significant (p=0.37). Furthermore, there was no statistically significant difference in the performance of ChatGPT compared to Bard when scoring each sub-question separately to give a total score out of 40 (p=0.26). The qualities of explanations were similar for both LLMs. Importantly, despite answering certain questions incorrectly, both AI chatbots provided some useful correct information in their explanations of the answers to these questions. The Fleiss multi-rater kappa was 0.899 (p<0.001) for ChatGPT and 0.801 (p<0.001) for Bard. Conclusions The performances of both Bard and ChatGPT were similar in answering the MCQs with similar scores achieved. Notably, despite having access to data across the web, neither of the LLMs answered all questions accurately. This suggests that there is still learning required of AI models in medical education.
Patel S
,Patel R
《Cureus》
-
"Dr. AI Will See You Now": How Do ChatGPT-4 Treatment Recommendations Align With Orthopaedic Clinical Practice Guidelines?
Artificial intelligence (AI) is engineered to emulate tasks that have historically required human interaction and intellect, including learning, pattern recognition, decision-making, and problem-solving. Although AI models like ChatGPT-4 have demonstrated satisfactory performance on medical licensing exams, suggesting a potential for supporting medical diagnostics and decision-making, no study of which we are aware has evaluated the ability of these tools to make treatment recommendations when given clinical vignettes and representative medical imaging of common orthopaedic conditions. As AI continues to advance, a thorough understanding of its strengths and limitations is necessary to inform safe and helpful integration into medical practice.
(1) What is the concordance between ChatGPT-4-generated treatment recommendations for common orthopaedic conditions with both the American Academy of Orthopaedic Surgeons (AAOS) clinical practice guidelines (CPGs) and an orthopaedic attending physician's treatment plan? (2) In what specific areas do the ChatGPT-4-generated treatment recommendations diverge from the AAOS CPGs?
Ten common orthopaedic conditions with associated AAOS CPGs were identified: carpal tunnel syndrome, distal radius fracture, glenohumeral joint osteoarthritis, rotator cuff injury, clavicle fracture, hip fracture, hip osteoarthritis, knee osteoarthritis, ACL injury, and acute Achilles rupture. For each condition, the medical records of 10 deidentified patients managed at our facility were used to construct clinical vignettes that each had an isolated, single diagnosis with adequate clarity. The vignettes also encompassed a range of diagnostic severity to evaluate more thoroughly adherence to the treatment guidelines outlined by the AAOS. These clinical vignettes were presented alongside representative radiographic imaging. The model was prompted to provide a single treatment plan recommendation. Each treatment plan was compared with established AAOS CPGs and to the treatment plan documented by the attending orthopaedic surgeon treating the specific patient. Vignettes where ChatGPT-4 recommendations diverged from CPGs were reviewed to identify patterns of error and summarized.
ChatGPT-4 provided treatment recommendations in accordance with the AAOS CPGs in 90% (90 of 100) of clinical vignettes. Concordance between ChatGPT-generated plans and the plan recommended by the treating orthopaedic attending physician was 78% (78 of 100). One hundred percent (30 of 30) of ChatGPT-4 recommendations for fracture vignettes and hip and knee arthritis vignettes matched with CPG recommendations, whereas the model struggled most with recommendations for carpal tunnel syndrome (3 of 10 instances demonstrated discordance). ChatGPT-4 recommendations diverged from AAOS CPGs for three carpal tunnel syndrome vignettes; two ACL injury, rotator cuff injury, and glenohumeral joint osteoarthritis vignettes; as well as one acute Achilles rupture vignette. In these situations, ChatGPT-4 most often struggled to correctly interpret injury severity and progression, incorporate patient factors (such as lifestyle or comorbidities) into decision-making, and recognize a contraindication to surgery.
ChatGPT-4 can generate accurate treatment plans aligned with CPGs but can also make mistakes when it is required to integrate multiple patient factors into decision-making and understand disease severity and progression. Physicians must critically assess the full clinical picture when using AI tools to support their decision-making.
ChatGPT-4 may be used as an on-demand diagnostic companion, but patient-centered decision-making should continue to remain in the hands of the physician.
Dagher T
,Dwyer EP
,Baker HP
,Kalidoss S
,Strelzow JA
... -
《-》