Accuracy and Completeness of Large Language Models About Antibody-Drug Conjugates and Associated Ocular Adverse Effects.
The purpose of this study was to assess the accuracy and completeness of 3 large language models (LLMs) to generate information about antibody-drug conjugate (ADC)-associated ocular toxicities.
There were 22 questions about ADCs, tisotumab vedotin, and mirvetuximab soravtansine that were developed and input into ChatGPT 4.0, Bard, and LLaMa. Answers were rated by 4 ocular toxicity experts using standardized 6-point Likert scales on accuracy and completeness. ANOVA tests were conducted for comparison between the 3 subgroups, followed by pairwise t-tests. Interrater variability was assessed with Fleiss kappa tests.
The mean accuracy score was 4.62 (SD 0.89) for ChatGPT, 4.77 (SD 0.90) for Bard, and 4.41 (SD 1.09) for LLaMA. Both ChatGPT (P = 0.03) and Bard (P = 0.003) scored significantly better for accuracy when compared with LLaMA. The mean completeness score was 4.43 (SD 0.91) for ChatGPT, 4.57 (SD 0.93) for Bard, and 4.42 (SD 0.99) for LLaMA. There were no significant differences in completeness scores between groups. Fleiss kappa assessment for interrater variability was good (0.74) for accuracy and fair (0.31) for completeness.
All 3 LLMs had relatively high accuracy and completeness ratings, showing LLMs are able to provide sufficient answers for niche topics of ophthalmology. Our results indicate that ChatGPT and Bard may be slightly better at providing more accurate answers than LLaMA. As further research and treatment plans are developed for ADC-associated ocular toxicities, these LLMs should be reassessed to see if they provide complete and accurate answers that remain in line with current medical knowledge.
Marshall R
,Xu H
,Dalvin LA
,Mishra K
,Edalat C
,Kirupaharan N
,Francis JH
,Berkenstock M
... -
《-》
Assessing the Accuracy of Information on Medication Abortion: A Comparative Analysis of ChatGPT and Google Bard AI.
Background and objective ChatGPT and Google Bard AI are widely used conversational chatbots, even in healthcare. While they have several strengths, they can generate seemingly correct but erroneous responses, warranting caution in medical contexts. In an era where access to abortion care is diminishing, patients may increasingly rely on online resources and AI-driven language models for information on medication abortions. In light of this, this study aimed to compare the accuracy and comprehensiveness of responses generated by ChatGPT 3.5 and Google Bard AI to medical queries about medication abortions. Methods Fourteen open-ended questions about medication abortion were formulated based on the Frequently Asked Questions (FAQs) from the National Abortion Federation (NAF) and the Reproductive Health Access Project (RHAP) websites. These questions were answered using ChatGPT version 3.5 and Google Bard AI on October 7, 2023. The accuracy of the responses was analyzed by cross-referencing the generated answers against the information provided by NAF and RHAP. Any discrepancies were further verified against the guidelines from the American Congress of Obstetricians and Gynecologists (ACOG). A rating scale used by Johnson et al. was employed for assessment, utilizing a 6-point Likert scale [ranging from 1 (completely incorrect) to 6 (correct)] to evaluate accuracy and a 3-point scale [ranging from 1 (incomplete) to 3 (comprehensive)] to assess completeness. Questions that did not yield answers were assigned a score of 0 and omitted from the correlation analysis. Data analysis and visualization were done using R Software version 4.3.1. Statistical significance was determined by employing Spearman's R and Mann-Whitney U tests. Results All questions were entered sequentially into both chatbots by the same author. On the initial attempt, ChatGPT successfully generated relevant responses for all questions, while Google Bard AI failed to provide answers for five questions. Repeating the same question in Google Bard AI yielded an answer for one; two were answered with different phrasing; and two remained unanswered despite rephrasing. ChatGPT showed a median accuracy score of 5 (mean: 5.26, SD: 0.73) and a median completeness score of 3 (mean: 2.57, SD: 0.51). It showed the highest accuracy score in six responses and the highest completeness score in eight responses. In contrast, Google Bard AI had a median accuracy score of 5 (mean: 4.5, SD: 2.03) and a median completeness score of 2 (mean: 2.14, SD: 1.03). It achieved the highest accuracy score in five responses and the highest completeness score in six responses. Spearman's correlation coefficient revealed no correlation between accuracy and completeness for ChatGPT (rs = -0.46771, p = 0.09171). However, Google Bard AI showed a marginally significant correlation (rs = 0.5738, p = 0.05108). Mann-Whitney U test indicated no statistically significant differences between ChatGPT and Google Bard AI concerning accuracy (U = 82, p>0.05) or completeness (U = 78, p>0.05). Conclusion While both chatbots showed similar levels of accuracy, minor errors were noted, pertaining to finer aspects that demand specialized knowledge of abortion care. This could explain the lack of a significant correlation between accuracy and completeness. Ultimately, AI-driven language models have the potential to provide information on medication abortions, but there is a need for continual refinement and oversight.
Mediboina A
,Badam RK
,Chodavarapu S
《Cureus》
Evidence-based potential of generative artificial intelligence large language models in orthodontics: a comparative study of ChatGPT, Google Bard, and Microsoft Bing.
The increasing utilization of large language models (LLMs) in Generative Artificial Intelligence across various medical and dental fields, and specifically orthodontics, raises questions about their accuracy.
This study aimed to assess and compare the answers offered by four LLMs: Google's Bard, OpenAI's ChatGPT-3.5, and ChatGPT-4, and Microsoft's Bing, in response to clinically relevant questions within the field of orthodontics.
Ten open-type clinical orthodontics-related questions were posed to the LLMs. The responses provided by the LLMs were assessed on a scale ranging from 0 (minimum) to 10 (maximum) points, benchmarked against robust scientific evidence, including consensus statements and systematic reviews, using a predefined rubric. After a 4-week interval from the initial evaluation, the answers were reevaluated to gauge intra-evaluator reliability. Statistical comparisons were conducted on the scores using Friedman's and Wilcoxon's tests to identify the model providing the answers with the most comprehensiveness, scientific accuracy, clarity, and relevance.
Overall, no statistically significant differences between the scores given by the two evaluators, on both scoring occasions, were detected, so an average score for every LLM was computed. The LLM answers scoring the highest, were those of Microsoft Bing Chat (average score = 7.1), followed by ChatGPT 4 (average score = 4.7), Google Bard (average score = 4.6), and finally ChatGPT 3.5 (average score 3.8). While Microsoft Bing Chat statistically outperformed ChatGPT-3.5 (P-value = 0.017) and Google Bard (P-value = 0.029), as well, and Chat GPT-4 outperformed Chat GPT-3.5 (P-value = 0.011), all models occasionally produced answers with a lack of comprehensiveness, scientific accuracy, clarity, and relevance.
The questions asked were indicative and did not cover the entire field of orthodontics.
Language models (LLMs) show great potential in supporting evidence-based orthodontics. However, their current limitations pose a potential risk of making incorrect healthcare decisions if utilized without careful consideration. Consequently, these tools cannot serve as a substitute for the orthodontist's essential critical thinking and comprehensive subject knowledge. For effective integration into practice, further research, clinical validation, and enhancements to the models are essential. Clinicians must be mindful of the limitations of LLMs, as their imprudent utilization could have adverse effects on patient care.
Makrygiannakis MA
,Giannakopoulos K
,Kaklamanos EG
《-》
Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use.
We aimed to assess Large Language Models (LLMs)-ChatGPT 3.5-4, BARD, and Bing-in their accuracy and completeness when answering Methotrexate (MTX) related questions for treating rheumatoid arthritis. We employed 23 questions from an earlier study related to MTX concerns. These questions were entered into the LLMs, and the responses generated by each model were evaluated by two reviewers using Likert scales to assess accuracy and completeness. The GPT models achieved a 100% correct answer rate, while BARD and Bing scored 73.91%. In terms of accuracy of the outputs (completely correct responses), GPT-4 achieved a score of 100%, GPT 3.5 secured 86.96%, and BARD and Bing each scored 60.87%. BARD produced 17.39% incorrect responses and 8.7% non-responses, while Bing recorded 13.04% incorrect and 13.04% non-responses. The ChatGPT models produced significantly more accurate responses than Bing for the "mechanism of action" category, and GPT-4 model showed significantly higher accuracy than BARD in the "side effects" category. There were no statistically significant differences among the models for the "lifestyle" category. GPT-4 achieved a comprehensive output of 100%, followed by GPT-3.5 at 86.96%, BARD at 60.86%, and Bing at 0%. In the "mechanism of action" category, both ChatGPT models and BARD produced significantly more comprehensive outputs than Bing. For the "side effects" and "lifestyle" categories, the ChatGPT models showed significantly higher completeness than Bing. The GPT models, particularly GPT 4, demonstrated superior performance in providing accurate and comprehensive patient information about MTX use. However, the study also identified inaccuracies and shortcomings in the generated responses.
Coskun BN
,Yagiz B
,Ocakoglu G
,Dalkilic E
,Pehlivan Y
... -
《-》