-
ChatGPT and Gemini Are Not Consistently Concordant With the 2020 American Academy of Orthopaedic Surgeons Clinical Practice Guidelines When Evaluating Rotator Cuff Injury.
To evaluate the accuracy of suggestions given by ChatGPT and Gemini (previously known as "Bard"), 2 widely used publicly available large language models, to evaluate the management of rotator cuff injuries.
The 2020 American Academy of Orthopaedic Surgeons (AAOS) Clinical Practice Guidelines (CPGs) were the basis for determining recommended and non-recommended treatments in this study. ChatGPT and Gemini were queried on 16 treatments based on these guidelines examining rotator cuff interventions. The responses were categorized as "concordant" or "discordant" with the AAOS CPGs. The Cohen κ coefficient was calculated to assess inter-rater reliability.
ChatGPT and Gemini showed concordance with the AAOS CPGs for 13 of the 16 treatments queried (81%) and 12 of the 16 treatments queried (75%), respectively. ChatGPT provided discordant responses with the AAOS CPGs for 3 treatments (19%), whereas Gemini provided discordant responses for 4 treatments (25%). Assessment of inter-rater reliability showed a Cohen κ coefficient of 0.98, signifying agreement between the raters in classifying the responses of ChatGPT and Gemini to the AAOS CPGs as being concordant or discordant.
ChatGPT and Gemini do not consistently provide responses that align with the AAOS CPGs.
This study provides evidence that cautions patients not to rely solely on artificial intelligence for recommendations about rotator cuff injuries.
Megafu M
,Guerrero O
,Yendluri A
,Parsons BO
,Galatz LM
,Li X
,Kelly JD 4th
,Parisien RL
... -
《-》
-
Currently Available Large Language Models Do Not Provide Musculoskeletal Treatment Recommendations That Are Concordant With Evidence-Based Clinical Practice Guidelines.
To determine whether several leading, commercially available large language models (LLMs) provide treatment recommendations concordant with evidence-based clinical practice guidelines (CPGs) developed by the American Academy of Orthopaedic Surgeons (AAOS).
All CPGs concerning the management of rotator cuff tears (n = 33) and anterior cruciate ligament injuries (n = 15) were extracted from the AAOS. Treatment recommendations from Chat-Generative Pretrained Transformer version 4 (ChatGPT-4), Gemini, Mistral-7B, and Claude-3 were graded by 2 blinded physicians as being concordant, discordant, or indeterminate (i.e., neutral response without definitive recommendation) with respect to AAOS CPGs. The overall concordance between LLM and AAOS recommendations was quantified, and the comparative overall concordance of recommendations among the 4 LLMs was evaluated through the Fisher exact test.
Overall, 135 responses (70.3%) were concordant, 43 (22.4%) were indeterminate, and 14 (7.3%) were discordant. Inter-rater reliability for concordance classification was excellent (κ = 0.92). Concordance with AAOS CPGs was most frequently observed with ChatGPT-4 (n = 38, 79.2%) and least frequently observed with Mistral-7B (n = 28, 58.3%). Indeterminate recommendations were most frequently observed with Mistral-7B (n = 17, 35.4%) and least frequently observed with Claude-3 (n = 8, 6.7%). Discordant recommendations were most frequently observed with Gemini (n = 6, 12.5%) and least frequently observed with ChatGPT-4 (n = 1, 2.1%). Overall, no statistically significant difference in concordant recommendations was observed across LLMs (P = .12). Of all recommendations, only 20 (10.4%) were transparent and provided references with full bibliographic details or links to specific peer-reviewed content to support recommendations.
Among leading commercially available LLMs, more than 1-in-4 recommendations concerning the evaluation and management of rotator cuff and anterior cruciate ligament injuries do not reflect current evidence-based CPGs. Although ChatGPT-4 showed the highest performance, clinically significant rates of recommendations without concordance or supporting evidence were observed. Only 10% of responses by LLMs were transparent, precluding users from fully interpreting the sources from which recommendations were provided.
Although leading LLMs generally provide recommendations concordant with CPGs, a substantial error rate exists, and the proportion of recommendations that do not align with these CPGs suggests that LLMs are not trustworthy clinical support tools at this time. Each off-the-shelf, closed-source LLM has strengths and weaknesses. Future research should evaluate and compare multiple LLMs to avoid bias associated with narrow evaluation of few models as observed in the current literature.
Nwachukwu BU
,Varady NH
,Allen AA
,Dines JS
,Altchek DW
,Williams RJ 3rd
,Kunze KN
... -
《-》
-
"Dr. AI Will See You Now": How Do ChatGPT-4 Treatment Recommendations Align With Orthopaedic Clinical Practice Guidelines?
Artificial intelligence (AI) is engineered to emulate tasks that have historically required human interaction and intellect, including learning, pattern recognition, decision-making, and problem-solving. Although AI models like ChatGPT-4 have demonstrated satisfactory performance on medical licensing exams, suggesting a potential for supporting medical diagnostics and decision-making, no study of which we are aware has evaluated the ability of these tools to make treatment recommendations when given clinical vignettes and representative medical imaging of common orthopaedic conditions. As AI continues to advance, a thorough understanding of its strengths and limitations is necessary to inform safe and helpful integration into medical practice.
(1) What is the concordance between ChatGPT-4-generated treatment recommendations for common orthopaedic conditions with both the American Academy of Orthopaedic Surgeons (AAOS) clinical practice guidelines (CPGs) and an orthopaedic attending physician's treatment plan? (2) In what specific areas do the ChatGPT-4-generated treatment recommendations diverge from the AAOS CPGs?
Ten common orthopaedic conditions with associated AAOS CPGs were identified: carpal tunnel syndrome, distal radius fracture, glenohumeral joint osteoarthritis, rotator cuff injury, clavicle fracture, hip fracture, hip osteoarthritis, knee osteoarthritis, ACL injury, and acute Achilles rupture. For each condition, the medical records of 10 deidentified patients managed at our facility were used to construct clinical vignettes that each had an isolated, single diagnosis with adequate clarity. The vignettes also encompassed a range of diagnostic severity to evaluate more thoroughly adherence to the treatment guidelines outlined by the AAOS. These clinical vignettes were presented alongside representative radiographic imaging. The model was prompted to provide a single treatment plan recommendation. Each treatment plan was compared with established AAOS CPGs and to the treatment plan documented by the attending orthopaedic surgeon treating the specific patient. Vignettes where ChatGPT-4 recommendations diverged from CPGs were reviewed to identify patterns of error and summarized.
ChatGPT-4 provided treatment recommendations in accordance with the AAOS CPGs in 90% (90 of 100) of clinical vignettes. Concordance between ChatGPT-generated plans and the plan recommended by the treating orthopaedic attending physician was 78% (78 of 100). One hundred percent (30 of 30) of ChatGPT-4 recommendations for fracture vignettes and hip and knee arthritis vignettes matched with CPG recommendations, whereas the model struggled most with recommendations for carpal tunnel syndrome (3 of 10 instances demonstrated discordance). ChatGPT-4 recommendations diverged from AAOS CPGs for three carpal tunnel syndrome vignettes; two ACL injury, rotator cuff injury, and glenohumeral joint osteoarthritis vignettes; as well as one acute Achilles rupture vignette. In these situations, ChatGPT-4 most often struggled to correctly interpret injury severity and progression, incorporate patient factors (such as lifestyle or comorbidities) into decision-making, and recognize a contraindication to surgery.
ChatGPT-4 can generate accurate treatment plans aligned with CPGs but can also make mistakes when it is required to integrate multiple patient factors into decision-making and understand disease severity and progression. Physicians must critically assess the full clinical picture when using AI tools to support their decision-making.
ChatGPT-4 may be used as an on-demand diagnostic companion, but patient-centered decision-making should continue to remain in the hands of the physician.
Dagher T
,Dwyer EP
,Baker HP
,Kalidoss S
,Strelzow JA
... -
《-》
-
Do ChatGPT and Gemini Provide Appropriate Recommendations for Pediatric Orthopaedic Conditions?
Artificial intelligence (AI), and in particular large language models (LLMs) such as Chat Generative Pre-Trained Transformer (ChatGPT) and Gemini have provided additional resources for patients to research the management of healthcare conditions, for their own edification and the advocacy in the care of their children. The accuracy of these models, however, and the sources from which they draw conclusions, have been largely unstudied in pediatric orthopaedics. This research aimed to assess the reliability of machine learning tools in providing appropriate recommendations for the care of common pediatric orthopaedic conditions.
ChatGPT and Gemini were queried using plain language generated from the American Academy of Orthopaedic Surgeons (AAOS) Clinical Practice Guidelines (CPGs) listed on the Pediatric Orthopedic Society of North America (POSNA) web page. Two independent reviewers assessed the accuracy of the responses, and chi-square analyses were used to compare the 2 LLMs. Inter-rater reliability was calculated via Cohen's Kappa coefficient. If research studies were cited, attempts were made to assess their legitimacy by searching the PubMed and Google Scholar databases.
ChatGPT and Gemini performed similarly, agreeing with the AAOS CPGs at a rate of 67% and 69%. No significant differences were observed in the performance between the 2 LLMs. ChatGPT did not reference specific studies in any response, whereas Gemini referenced a total of 16 research papers in 6 of 24 responses. 12 of the 16 studies referenced contained errors and either were unable to be identified (7) or contained discrepancies (5) regarding publication year, journal, or proper accreditation of authorship.
The LLMs investigated were frequently aligned with the AAOS CPGs; however, the rate of neutral statements or disagreement with consensus recommendations was substantial and frequently contained errors with citations of sources. These findings suggest there remains room for growth and transparency in the development of the models which power AI, and they may not yet represent the best source of up-to-date healthcare information for patients or providers.
Pirkle S
,Yang J
,Blumberg TJ
《-》
-
Pediatric Supracondylar Humerus and Diaphyseal Femur Fractures: A Comparative Analysis of Chat Generative Pretrained Transformer and Google Gemini Recommendations Versus American Academy of Orthopaedic Surgeons Clinical Practice Guidelines.
Artificial intelligence (AI) chatbots, including chat generative pretrained transformer (ChatGPT) and Google Gemini, have significantly increased access to medical information. However, in pediatric orthopaedics, no study has evaluated the accuracy of AI chatbots compared with evidence-based recommendations, including the American Academy of Orthopaedic Surgeons clinical practice guidelines (AAOS CPGs). The aims of this study were to compare responses by ChatGPT-4.0, ChatGPT-3.5, and Google Gemini with AAOS CPG recommendations on pediatric supracondylar humerus and diaphyseal femur fractures regarding accuracy, supplementary and incomplete response patterns, and readability.
ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were prompted by questions created from 13 evidence-based recommendations (6 from the 2011 AAOS CPG on pediatric supracondylar humerus fractures; 7 from the 2020 AAOS CPG on pediatric diaphyseal femur fractures). Responses were anonymized and independently evaluated by 2 pediatric orthopaedic attending surgeons. Supplementary responses were, in addition, evaluated on whether no, some, or many modifications were necessary. Readability metrics (response length, Flesch-Kincaid reading level, Flesch Reading Ease, Gunning Fog Index) were compared. Cohen Kappa interrater reliability (κ) was calculated. χ 2 analyses and single-factor analysis of variance were utilized to compare categorical and continuous variables, respectively. Statistical significance was set with P <0.05.
ChatGPT-4.0, ChatGPT-3.5, and Google Gemini were accurate in 11/13, 9/13, and 11/13, supplementary in 13/13, 11/13, and 13/13, and incomplete in 3/13, 4/13, and 4/13 recommendations, respectively. Of 37 supplementary responses, 17 (45.9%), 19 (51.4%), and 1 (2.7%) required no, some, and many modifications, respectively. There were no significant differences in accuracy ( P = 0.533), supplementary responses ( P = 0.121), necessary modifications ( P = 0.580), and incomplete responses ( P = 0.881). Overall κ was moderate at 0.55. ChatGPT-3.5 provided shorter responses ( P = 0.002), but Google Gemini was more readable in terms of Flesch-Kincaid Grade Level ( P = 0.002), Flesch Reading Ease ( P < 0.001), and Gunning Fog Index ( P = 0.021).
While AI chatbots provided responses with reasonable accuracy, most supplemental information required modification and had complex readability. Improvements are necessary before AI chatbots can be reliably used for patient education.
Level IV.
Nian PP
,Umesh A
,Simpson SK
,Tracey OC
,Nichols E
,Logterman S
,Doyle SM
,Heyer JH
... -
《-》