-
Assessing the Responses of Large Language Models (ChatGPT-4, Gemini, and Microsoft Copilot) to Frequently Asked Questions in Breast Imaging: A Study on Readability and Accuracy.
Background Large language models (LLMs), such as ChatGPT-4, Gemini, and Microsoft Copilot, have been instrumental in various domains, including healthcare, where they enhance health literacy and aid in patient decision-making. Given the complexities involved in breast imaging procedures, accurate and comprehensible information is vital for patient engagement and compliance. This study aims to evaluate the readability and accuracy of the information provided by three prominent LLMs, ChatGPT-4, Gemini, and Microsoft Copilot, in response to frequently asked questions in breast imaging, assessing their potential to improve patient understanding and facilitate healthcare communication. Methodology We collected the most common questions on breast imaging from clinical practice and posed them to LLMs. We then evaluated the responses in terms of readability and accuracy. Responses from LLMs were analyzed for readability using the Flesch Reading Ease and Flesch-Kincaid Grade Level tests and for accuracy through a radiologist-developed Likert-type scale. Results The study found significant variations among LLMs. Gemini and Microsoft Copilot scored higher on readability scales (p < 0.001), indicating their responses were easier to understand. In contrast, ChatGPT-4 demonstrated greater accuracy in its responses (p < 0.001). Conclusions While LLMs such as ChatGPT-4 show promise in providing accurate responses, readability issues may limit their utility in patient education. Conversely, Gemini and Microsoft Copilot, despite being less accurate, are more accessible to a broader patient audience. Ongoing adjustments and evaluations of these models are essential to ensure they meet the diverse needs of patients, emphasizing the need for continuous improvement and oversight in the deployment of artificial intelligence technologies in healthcare.
Tepe M
,Emekli E
《Cureus》
-
Assessing the Readability of Patient Education Materials on Cardiac Catheterization From Artificial Intelligence Chatbots: An Observational Cross-Sectional Study.
Artificial intelligence (AI) is a burgeoning new field that has increased in popularity over the past couple of years, coinciding with the public release of large language model (LLM)-driven chatbots. These chatbots, such as ChatGPT, can be engaged directly in conversation, allowing users to ask them questions or issue other commands. Since LLMs are trained on large amounts of text data, they can also answer questions reliably and factually, an ability that has allowed them to serve as a source for medical inquiries. This study seeks to assess the readability of patient education materials on cardiac catheterization across four of the most common chatbots: ChatGPT, Microsoft Copilot, Google Gemini, and Meta AI.
A set of 10 questions regarding cardiac catheterization was developed using website-based patient education materials on the topic. We then asked these questions in consecutive order to four of the most common chatbots: ChatGPT, Microsoft Copilot, Google Gemini, and Meta AI. The Flesch Reading Ease Score (FRES) was used to assess the readability score. Readability grade levels were assessed using six tools: Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI), Coleman-Liau Index (CLI), Simple Measure of Gobbledygook (SMOG) Index, Automated Readability Index (ARI), and FORCAST Grade Level.
The mean FRES across all four chatbots was 40.2, while overall mean grade levels for the four chatbots were 11.2, 13.7, 13.7, 13.3, 11.2, and 11.6 across the FKGL, GFI, CLI, SMOG, ARI, and FORCAST indices, respectively. Mean reading grade levels across the six tools were 14.8 for ChatGPT, 12.3 for Microsoft Copilot, 13.1 for Google Gemini, and 9.6 for Meta AI. Further, FRES values for the four chatbots were 31, 35.8, 36.4, and 57.7, respectively.
This study shows that AI chatbots are capable of providing answers to medical questions regarding cardiac catheterization. However, the responses across the four chatbots had overall mean reading grade levels at the 11th-13th-grade level, depending on the tool used. This means that the materials were at the high school and even college reading level, which far exceeds the recommended sixth-grade level for patient education materials. Further, there is significant variability in the readability levels provided by different chatbots as, across all six grade-level assessments, Meta AI had the lowest scores and ChatGPT generally had the highest.
Behers BJ
,Vargas IA
,Behers BM
,Rosario MA
,Wojtas CN
,Deevers AC
,Hamad KM
... -
《Cureus》
-
Can artificial intelligence models serve as patient information consultants in orthodontics?
To evaluate the accuracy, reliability, quality, and readability of responses generated by ChatGPT-3.5, ChatGPT-4, Gemini, and Copilot in relation to orthodontic clear aligners.
Frequently asked questions by patients/laypersons about clear aligners on websites were identified using the Google search tool and these questions were posed to ChatGPT-3.5, ChatGPT-4, Gemini, and Copilot AI models. Responses were assessed using a five-point Likert scale for accuracy, the modified DISCERN scale for reliability, the Global Quality Scale (GQS) for quality, and the Flesch Reading Ease Score (FRES) for readability.
ChatGPT-4 responses had the highest mean Likert score (4.5 ± 0.61), followed by Copilot (4.35 ± 0.81), ChatGPT-3.5 (4.15 ± 0.75) and Gemini (4.1 ± 0.72). The difference between the Likert scores of the chatbot models was not statistically significant (p > 0.05). Copilot had a significantly higher modified DISCERN and GQS score compared to both Gemini, ChatGPT-4 and ChatGPT-3.5 (p < 0.05). Gemini's modified DISCERN and GQS score was statistically higher than ChatGPT-3.5 (p < 0.05). Gemini also had a significantly higher FRES compared to both ChatGPT-4, Copilot and ChatGPT-3.5 (p < 0.05). The mean FRES was 38.39 ± 11.56 for ChatGPT-3.5, 43.88 ± 10.13 for ChatGPT-4 and 41.72 ± 10.74 for Copilot, indicating that the responses were difficult to read according to the reading level. The mean FRES for Gemini is 54.12 ± 10.27, indicating that Gemini's responses are more readable than other chatbots.
All chatbot models provided generally accurate, moderate reliable and moderate to good quality answers to questions about the clear aligners. Furthermore, the readability of the responses was difficult. ChatGPT, Gemini and Copilot have significant potential as patient information tools in orthodontics, however, to be fully effective they need to be supplemented with more evidence-based information and improved readability.
Dursun D
,Bilici Geçer R
《BMC Medical Informatics and Decision Making》
-
Evaluating text and visual diagnostic capabilities of large language models on questions related to the Breast Imaging Reporting and Data System Atlas 5(th) edition.
Güneş YC
,Cesur T
,Çamur E
,Günbey Karabekmez L
... -
《Diagnostic and Interventional Radiology》
-
Can AI Answer My Questions? Utilizing Artificial Intelligence in the Perioperative Assessment for Abdominoplasty Patients.
Abdominoplasty is a common operation, used for a range of cosmetic and functional issues, often in the context of divarication of recti, significant weight loss, and after pregnancy. Despite this, patient-surgeon communication gaps can hinder informed decision-making. The integration of large language models (LLMs) in healthcare offers potential for enhancing patient information. This study evaluated the feasibility of using LLMs for answering perioperative queries.
This study assessed the efficacy of four leading LLMs-OpenAI's ChatGPT-3.5, Anthropic's Claude, Google's Gemini, and Bing's CoPilot-using fifteen unique prompts. All outputs were evaluated using the Flesch-Kincaid, Flesch Reading Ease score, and Coleman-Liau index for readability assessment. The DISCERN score and a Likert scale were utilized to evaluate quality. Scores were assigned by two plastic surgical residents and then reviewed and discussed until a consensus was reached by five plastic surgeon specialists.
ChatGPT-3.5 required the highest level for comprehension, followed by Gemini, Claude, then CoPilot. Claude provided the most appropriate and actionable advice. In terms of patient-friendliness, CoPilot outperformed the rest, enhancing engagement and information comprehensiveness. ChatGPT-3.5 and Gemini offered adequate, though unremarkable, advice, employing more professional language. CoPilot uniquely included visual aids and was the only model to use hyperlinks, although they were not very helpful and acceptable, and it faced limitations in responding to certain queries.
ChatGPT-3.5, Gemini, Claude, and Bing's CoPilot showcased differences in readability and reliability. LLMs offer unique advantages for patient care but require careful selection. Future research should integrate LLM strengths and address weaknesses for optimal patient education.
This journal requires that authors assign a level of evidence to each article. For a full description of these Evidence-Based Medicine ratings, please refer to the Table of Contents or the online Instructions to Authors www.springer.com/00266 .
Lim B
,Seth I
,Cuomo R
,Kenney PS
,Ross RJ
,Sofiadellis F
,Pentangelo P
,Ceccaroni A
,Alfano C
,Rozen WM
... -
《-》