-
Is the information provided by large language models valid in educating patients about adolescent idiopathic scoliosis? An evaluation of content, clarity, and empathy : The perspective of the European Spine Study Group.
Large language models (LLM) have the potential to bridge knowledge gaps in patient education and enrich patient-surgeon interactions. This study evaluated three chatbots for delivering empathetic and precise adolescent idiopathic scoliosis (AIS) related information and management advice. Specifically, we assessed the accuracy, clarity, and relevance of the information provided, aiming to determine the effectiveness of LLMs in addressing common patient queries and enhancing their understanding of AIS.
We sourced 20 webpages for the top frequently asked questions (FAQs) about AIS and formulated 10 critical questions based on them. Three advanced LLMs-ChatGPT 3.5, ChatGPT 4.0, and Google Bard-were selected to answer these questions, with responses limited to 200 words. The LLMs' responses were evaluated by a blinded group of experienced deformity surgeons (members of the European Spine Study Group) from seven European spine centers. A pre-established 4-level rating system from excellent to unsatisfactory was used with a further rating for clarity, comprehensiveness, and empathy on the 5-point Likert scale. If not rated 'excellent', the raters were asked to report the reasons for their decision for each question. Lastly, raters were asked for their opinion towards AI in healthcare in general in six questions.
The responses among all LLMs were 'excellent' in 26% of responses, with ChatGPT-4.0 leading (39%), followed by Bard (17%). ChatGPT-4.0 was rated superior to Bard and ChatGPT 3.5 (p = 0.003). Discrepancies among raters were significant (p < 0.0001), questioning inter-rater reliability. No substantial differences were noted in answer distribution by question (p = 0.43). The answers on diagnosis (Q2) and causes (Q4) of AIS were top-rated. The most dissatisfaction was seen in the answers regarding definitions (Q1) and long-term results (Q7). Exhaustiveness, clarity, empathy, and length of the answers were positively rated (> 3.0 on 5.0) and did not demonstrate any differences among LLMs. However, GPT-3.5 struggled with language suitability and empathy, while Bard's responses were overly detailed and less empathetic. Overall, raters found that 9% of answers were off-topic and 22% contained clear mistakes.
Our study offers crucial insights into the strengths and weaknesses of current LLMs in AIS patient and parent education, highlighting the promise of advancements like ChatGPT-4.o and Gemini alongside the need for continuous improvement in empathy, contextual understanding, and language appropriateness.
Lang S
,Vitale J
,Galbusera F
,Fekete T
,Boissiere L
,Charles YP
,Yucekul A
,Yilgor C
,Núñez-Pereira S
,Haddad S
,Gomez-Rice A
,Mehta J
,Pizones J
,Pellisé F
,Obeid I
,Alanay A
,Kleinstück F
,Loibl M
,ESSG European Spine Study Group
... -
《-》
-
Analyzing Large Language Models' Responses to Common Lumbar Spine Fusion Surgery Questions: A Comparison Between ChatGPT and Bard.
Lang SP
,Yoseph ET
,Gonzalez-Suarez AD
,Kim R
,Fatemi P
,Wagner K
,Maldaner N
,Stienen MN
,Zygourakis CC
... -
《-》
-
Are large language models valid tools for patient information on lumbar disc herniation? The spine surgeons' perspective.
Generative AI is revolutionizing patient education in healthcare, particularly through chatbots that offer personalized, clear medical information. Reliability and accuracy are vital in AI-driven patient education.
How effective are Large Language Models (LLM), such as ChatGPT and Google Bard, in delivering accurate and understandable patient education on lumbar disc herniation?
Ten Frequently Asked Questions about lumbar disc herniation were selected from 133 questions and were submitted to three LLMs. Six experienced spine surgeons rated the responses on a scale from "excellent" to "unsatisfactory," and evaluated the answers for exhaustiveness, clarity, empathy, and length. Statistical analysis involved Fleiss Kappa, Chi-square, and Friedman tests.
Out of the responses, 27.2% were excellent, 43.9% satisfactory with minimal clarification, 18.3% satisfactory with moderate clarification, and 10.6% unsatisfactory. There were no significant differences in overall ratings among the LLMs (p = 0.90); however, inter-rater reliability was not achieved, and large differences among raters were detected in the distribution of answer frequencies. Overall, ratings varied among the 10 answers (p = 0.043). The average ratings for exhaustiveness, clarity, empathy, and length were above 3.5/5.
LLMs show potential in patient education for lumbar spine surgery, with generally positive feedback from evaluators. The new EU AI Act, enforcing strict regulation on AI systems, highlights the need for rigorous oversight in medical contexts. In the current study, the variability in evaluations and occasional inaccuracies underline the need for continuous improvement. Future research should involve more advanced models to enhance patient-physician communication.
Lang S
,Vitale J
,Fekete TF
,Haschtmann D
,Reitmeir R
,Ropelato M
,Puhakka J
,Galbusera F
,Loibl M
... -
《-》
-
Benchmarking large language models' performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard.
Large language models (LLMs) are garnering wide interest due to their human-like and contextually relevant responses. However, LLMs' accuracy across specific medical domains has yet been thoroughly evaluated. Myopia is a frequent topic which patients and parents commonly seek information online. Our study evaluated the performance of three LLMs namely ChatGPT-3.5, ChatGPT-4.0, and Google Bard, in delivering accurate responses to common myopia-related queries.
We curated thirty-one commonly asked myopia care-related questions, which were categorised into six domains-pathogenesis, risk factors, clinical presentation, diagnosis, treatment and prevention, and prognosis. Each question was posed to the LLMs, and their responses were independently graded by three consultant-level paediatric ophthalmologists on a three-point accuracy scale (poor, borderline, good). A majority consensus approach was used to determine the final rating for each response. 'Good' rated responses were further evaluated for comprehensiveness on a five-point scale. Conversely, 'poor' rated responses were further prompted for self-correction and then re-evaluated for accuracy.
ChatGPT-4.0 demonstrated superior accuracy, with 80.6% of responses rated as 'good', compared to 61.3% in ChatGPT-3.5 and 54.8% in Google Bard (Pearson's chi-squared test, all p ≤ 0.009). All three LLM-Chatbots showed high mean comprehensiveness scores (Google Bard: 4.35; ChatGPT-4.0: 4.23; ChatGPT-3.5: 4.11, out of a maximum score of 5). All LLM-Chatbots also demonstrated substantial self-correction capabilities: 66.7% (2 in 3) of ChatGPT-4.0's, 40% (2 in 5) of ChatGPT-3.5's, and 60% (3 in 5) of Google Bard's responses improved after self-correction. The LLM-Chatbots performed consistently across domains, except for 'treatment and prevention'. However, ChatGPT-4.0 still performed superiorly in this domain, receiving 70% 'good' ratings, compared to 40% in ChatGPT-3.5 and 45% in Google Bard (Pearson's chi-squared test, all p ≤ 0.001).
Our findings underscore the potential of LLMs, particularly ChatGPT-4.0, for delivering accurate and comprehensive responses to myopia-related queries. Continuous strategies and evaluations to improve LLMs' accuracy remain crucial.
Dr Yih-Chung Tham was supported by the National Medical Research Council of Singapore (NMRC/MOH/HCSAINV21nov-0001).
Lim ZW
,Pushpanathan K
,Yew SME
,Lai Y
,Sun CH
,Lam JSH
,Chen DZ
,Goh JHL
,Tan MCJ
,Sheng B
,Cheng CY
,Koh VTC
,Tham YC
... -
《EBioMedicine》
-
Large Language Models and Empathy: Systematic Review.
Empathy, a fundamental aspect of human interaction, is characterized as the ability to experience another being's emotions within oneself. In health care, empathy is a fundamental for health care professionals and patients' interaction. It is a unique quality to humans that large language models (LLMs) are believed to lack.
We aimed to review the literature on the capacity of LLMs in demonstrating empathy.
We conducted a literature search on MEDLINE, Google Scholar, PsyArXiv, medRxiv, and arXiv between December 2022 and February 2024. We included English-language full-length publications that evaluated empathy in LLMs' outputs. We excluded papers evaluating other topics related to emotional intelligence that were not specifically empathy. The included studies' results, including the LLMs used, performance in empathy tasks, and limitations of the models, along with studies' metadata were summarized.
A total of 12 studies published in 2023 met the inclusion criteria. ChatGPT-3.5 (OpenAI) was evaluated in all studies, with 6 studies comparing it with other LLMs such GPT-4, LLaMA (Meta), and fine-tuned chatbots. Seven studies focused on empathy within a medical context. The studies reported LLMs to exhibit elements of empathy, including emotions recognition and emotional support in diverse contexts. Evaluation metric included automatic metrics such as Recall-Oriented Understudy for Gisting Evaluation and Bilingual Evaluation Understudy, and human subjective evaluation. Some studies compared performance on empathy with humans, while others compared between different models. In some cases, LLMs were observed to outperform humans in empathy-related tasks. For example, ChatGPT-3.5 was evaluated for its responses to patients' questions from social media, where ChatGPT's responses were preferred over those of humans in 78.6% of cases. Other studies used subjective readers' assigned scores. One study reported a mean empathy score of 1.84-1.9 (scale 0-2) for their fine-tuned LLM, while a different study evaluating ChatGPT-based chatbots reported a mean human rating of 3.43 out of 4 for empathetic responses. Other evaluations were based on the level of the emotional awareness scale, which was reported to be higher for ChatGPT-3.5 than for humans. Another study evaluated ChatGPT and GPT-4 on soft-skills questions in the United States Medical Licensing Examination, where GPT-4 answered 90% of questions correctly. Limitations were noted, including repetitive use of empathic phrases, difficulty following initial instructions, overly lengthy responses, sensitivity to prompts, and overall subjective evaluation metrics influenced by the evaluator's background.
LLMs exhibit elements of cognitive empathy, recognizing emotions and providing emotionally supportive responses in various contexts. Since social skills are an integral part of intelligence, these advancements bring LLMs closer to human-like interactions and expand their potential use in applications requiring emotional intelligence. However, there remains room for improvement in both the performance of these models and the evaluation strategies used for assessing soft skills.
Sorin V
,Brin D
,Barash Y
,Konen E
,Charney A
,Nadkarni G
,Klang E
... -
《JOURNAL OF MEDICAL INTERNET RESEARCH》