-
Do large language model chatbots perform better than established patient information resources in answering patient questions? A comparative study on melanoma.
Large language models (LLMs) have a potential role in providing adequate patient information.
To compare the quality of LLM responses with established Dutch patient information resources (PIRs) in answering patient questions regarding melanoma.
Responses from ChatGPT versions 3.5 and 4.0, Gemini, and three leading Dutch melanoma PIRs to 50 melanoma-specific questions were examined at baseline and for LLMs again after 8 months. Outcomes included (medical) accuracy, completeness, personalization, readability and, additionally, reproducibility for LLMs. Comparative analyses were performed within LLMs and PIRs using Friedman's Anova, and between best-performing LLMs and gold-standard (GS) PIRs using the Wilcoxon signed-rank test.
Within LLMs, ChatGPT-3.5 demonstrated the highest accuracy (P = 0.009). Gemini performed best in completeness (P < 0.001), personalization (P = 0.007) and readability (P < 0.001). PIRs were consistent in accuracy and completeness, with the general practitioner's website excelling in personalization (P = 0.013) and readability (P < 0.001). The best-performing LLMs outperformed the GS-PIR on completeness and personalization, yet it was less accurate and less readable. Over time, response reproducibility decreased for all LLMs, showing variability across outcomes.
Although LLMs show potential in providing highly personalized and complete responses to patient questions regarding melanoma, improving and safeguarding accuracy, reproducibility and accessibility is crucial before they can replace or complement conventional PIRs.
Kamminga NCW
,Kievits JEC
,Plaisier PW
,Burgers JS
,van der Veldt AM
,van den Brand JAGJ
,Mulder M
,Wakkee M
,Lugtenberg M
,Nijsten T
... -
《-》
-
Evaluating the Effectiveness of Large Language Models in Providing Patient Education for Chinese Patients With Ocular Myasthenia Gravis: Mixed Methods Study.
Ocular myasthenia gravis (OMG) is a neuromuscular disorder primarily affecting the extraocular muscles, leading to ptosis and diplopia. Effective patient education is crucial for disease management; however, in China, limited health care resources often restrict patients' access to personalized medical guidance. Large language models (LLMs) have emerged as potential tools to bridge this gap by providing instant, AI-driven health information. However, their accuracy and readability in educating patients with OMG remain uncertain.
The purpose of this study was to systematically evaluate the effectiveness of multiple LLMs in the education of Chinese patients with OMG. Specifically, the validity of these models in answering patients with OMG-related questions was assessed through accuracy, completeness, readability, usefulness, and safety, and patients' ratings of their usability and readability were analyzed.
The study was conducted in two phases: 130 choice ophthalmology examination questions were input into 5 different LLMs. Their performance was compared with that of undergraduates, master's students, and ophthalmology residents. In addition, 23 common patients with OMG-related patient questions were posed to 4 LLMs, and their responses were evaluated by ophthalmologists across 5 domains. In the second phase, 20 patients with OMG interacted with the 2 LLMs from the first phase, each asking 3 questions. Patients assessed the responses for satisfaction and readability, while ophthalmologists evaluated the responses again using the 5 domains.
ChatGPT o1-preview achieved the highest accuracy rate of 73% on 130 ophthalmology examination questions, outperforming other LLMs and professional groups like undergraduates and master's students. For 23 common patients with OMG-related questions, ChatGPT o1-preview scored highest in correctness (4.44), completeness (4.44), helpfulness (4.47), and safety (4.6). GEMINI (Google DeepMind) provided the easiest-to-understand responses in readability assessments, while GPT-4o had the most complex responses, suitable for readers with higher education levels. In the second phase with 20 patients with OMG, ChatGPT o1-preview received higher satisfaction scores than Ernie 3.5 (Baidu; 4.40 vs 3.89, P=.002), although Ernie 3.5's responses were slightly more readable (4.31 vs 4.03, P=.01).
LLMs such as ChatGPT o1-preview may have the potential to enhance patient education. Addressing challenges such as misinformation risk, readability issues, and ethical considerations is crucial for their effective and safe integration into clinical practice.
Wei B
,Yao L
,Hu X
,Hu Y
,Rao J
,Ji Y
,Dong Z
,Duan Y
,Wu X
... -
《JOURNAL OF MEDICAL INTERNET RESEARCH》
-
Exploring the performance of large language models on hepatitis B infection-related questions: A comparative study.
Patients with hepatitis B virus (HBV) infection require chronic and personalized care to improve outcomes. Large language models (LLMs) can potentially provide medical information for patients.
To examine the performance of three LLMs, ChatGPT-3.5, ChatGPT-4.0, and Google Gemini, in answering HBV-related questions.
LLMs' responses to HBV-related questions were independently graded by two medical professionals using a four-point accuracy scale, and disagreements were resolved by a third reviewer. Each question was run three times using three LLMs. Readability was assessed via the Gunning Fog index and Flesch-Kincaid grade level.
Overall, all three LLM chatbots achieved high average accuracy scores for subjective questions (ChatGPT-3.5: 3.50; ChatGPT-4.0: 3.69; Google Gemini: 3.53, out of a maximum score of 4). With respect to objective questions, ChatGPT-4.0 achieved an 80.8% accuracy rate, compared with 62.9% for ChatGPT-3.5 and 73.1% for Google Gemini. Across the six domains, ChatGPT-4.0 performed better in terms of diagnosis, whereas Google Gemini demonstrated excellent clinical manifestations. Notably, in the readability analysis, the mean Gunning Fog index and Flesch-Kincaid grade level scores of the three LLM chatbots were significantly higher than the standard level eight, far exceeding the reading level of the normal population.
Our results highlight the potential of LLMs, especially ChatGPT-4.0, for delivering responses to HBV-related questions. LLMs may be an adjunctive informational tool for patients and physicians to improve outcomes. Nevertheless, current LLMs should not replace personalized treatment recommendations from physicians in the management of HBV infection.
Li Y
,Huang CK
,Hu Y
,Zhou XD
,He C
,Zhong JW
... -
《-》
-
Evaluating large language models as patient education tools for inflammatory bowel disease: A comparative study.
Inflammatory bowel disease (IBD) is a global health burden that affects millions of individuals worldwide, necessitating extensive patient education. Large language models (LLMs) hold promise for addressing patient information needs. However, LLM use to deliver accurate and comprehensible IBD-related medical information has yet to be thoroughly investigated.
To assess the utility of three LLMs (ChatGPT-4.0, Claude-3-Opus, and Gemini-1.5-Pro) as a reference point for patients with IBD.
In this comparative study, two gastroenterology experts generated 15 IBD-related questions that reflected common patient concerns. These questions were used to evaluate the performance of the three LLMs. The answers provided by each model were independently assessed by three IBD-related medical experts using a Likert scale focusing on accuracy, comprehensibility, and correlation. Simultaneously, three patients were invited to evaluate the comprehensibility of their answers. Finally, a readability assessment was performed.
Overall, each of the LLMs achieved satisfactory levels of accuracy, comprehensibility, and completeness when answering IBD-related questions, although their performance varies. All of the investigated models demonstrated strengths in providing basic disease information such as IBD definition as well as its common symptoms and diagnostic methods. Nevertheless, when dealing with more complex medical advice, such as medication side effects, dietary adjustments, and complication risks, the quality of answers was inconsistent between the LLMs. Notably, Claude-3-Opus generated answers with better readability than the other two models.
LLMs have the potential as educational tools for patients with IBD; however, there are discrepancies between the models. Further optimization and the development of specialized models are necessary to ensure the accuracy and safety of the information provided.
Zhang Y
,Wan XH
,Kong QZ
,Liu H
,Liu J
,Guo J
,Yang XY
,Zuo XL
,Li YQ
... -
《-》
-
Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3.
This study aimed to evaluate the accuracy, comprehensiveness, and readability of responses generated by various Large Language Models (LLMs) (ChatGPT-3.5, Gemini, Claude 3, and GPT-4.0) in the clinical context of uveitis, utilizing a meticulous grading methodology.
Twenty-seven clinical uveitis questions were presented individually to four Large Language Models (LLMs): ChatGPT (versions GPT-3.5 and GPT-4.0), Google Gemini, and Claude. Three experienced uveitis specialists independently assessed the responses for accuracy using a three-point scale across three rounds with a 48-hour wash-out interval. The final accuracy rating for each LLM response ('Excellent', 'Marginal', or 'Deficient') was determined through a majority consensus approach. Comprehensiveness was evaluated using a three-point scale for responses rated 'Excellent' in the final accuracy assessment. Readability was determined using the Flesch-Kincaid Grade Level formula. Statistical analyses were conducted to discern significant differences among LLMs, employing a significance threshold of p < 0.05.
Claude 3 and ChatGPT 4 demonstrated significantly higher accuracy compared to Gemini (p < 0.001). Claude 3 also showed the highest proportion of 'Excellent' ratings (96.3%), followed by ChatGPT 4 (88.9%). ChatGPT 3.5, Claude 3, and ChatGPT 4 had no responses rated as 'Deficient', unlike Gemini (14.8%) (p = 0.014). ChatGPT 4 exhibited greater comprehensiveness compared to Gemini (p = 0.008), and Claude 3 showed higher comprehensiveness compared to Gemini (p = 0.042). Gemini showed significantly better readability compared to ChatGPT 3.5, Claude 3, and ChatGPT 4 (p < 0.001). Gemini also had fewer words, letter characters, and sentences compared to ChatGPT 3.5 and Claude 3.
Our study highlights the outstanding performance of Claude 3 and ChatGPT 4 in providing precise and thorough information regarding uveitis, surpassing Gemini. ChatGPT 4 and Claude 3 emerge as pivotal tools in improving patient understanding and involvement in their uveitis healthcare journey.
Zhao FF
,He HJ
,Liang JJ
,Cen J
,Wang Y
,Lin H
,Chen F
,Li TP
,Yang JF
,Chen L
,Cen LP
... -
《-》