-
Digesting Digital Health: A Study of Appropriateness and Readability of ChatGPT-Generated Gastroenterological Information.
The advent of artificial intelligence-powered large language models capable of generating interactive responses to intricate queries marks a groundbreaking development in how patients access medical information. Our aim was to evaluate the appropriateness and readability of gastroenterological information generated by Chat Generative Pretrained Transformer (ChatGPT).
We analyzed responses generated by ChatGPT to 16 dialog-based queries assessing symptoms and treatments for gastrointestinal conditions and 13 definition-based queries on prevalent topics in gastroenterology. Three board-certified gastroenterologists evaluated output appropriateness with a 5-point Likert-scale proxy measurement of currency, relevance, accuracy, comprehensiveness, clarity, and urgency/next steps. Outputs with a score of 4 or 5 in all 6 categories were designated as "appropriate." Output readability was assessed with Flesch Reading Ease score, Flesch-Kinkaid Reading Level, and Simple Measure of Gobbledygook scores.
ChatGPT responses to 44% of the 16 dialog-based and 69% of the 13 definition-based questions were deemed appropriate, and the proportion of appropriate responses within the 2 groups of questions was not significantly different ( P = 0.17). Notably, none of ChatGPT's responses to questions related to gastrointestinal emergencies were designated appropriate. The mean readability scores showed that outputs were written at a college-level reading proficiency.
ChatGPT can produce generally fitting responses to gastroenterological medical queries, but responses were constrained in appropriateness and readability, which limits the current utility of this large language model. Substantial development is essential before these models can be unequivocally endorsed as reliable sources of medical information.
Toiv A
,Saleh Z
,Ishak A
,Alsheik E
,Venkat D
,Nandi N
,Zuchelli TE
... -
《Clinical and Translational Gastroenterology》
-
Assessing the Quality, Readability, and Acceptability of AI-Generated Information in Plastic and Aesthetic Surgery.
Within plastic surgery, a patient's most commonly used first point of information before consulting a surgeon is the internet. Free-to-use artificial intelligence (AI) websites like ChatGPT (Generative Pre-trained Transformers) are attractive applications for patient information due to their ability to instantaneously answer almost any query. Although relatively new, ChatGPT is now one of the most popular artificial intelligence conversational software tools. The aim of this study was to evaluate the quality and readability of information given by ChatGPT-4 on key areas in plastic and reconstructive surgery.
The ten plastic and aesthetic surgery topics with the highest worldwide search volume in the 15 years were identified. These were rephrased into question format to create nine individual questions. These questions were then input into ChatGPT-4. The response quality was assessed using the DISCERN. The readability and grade reading level of the responses were calculated using the Flesch-Kincaid Reading Ease Index and Coleman-Liau Index. Twelve physicians working in a plastic and reconstructive surgery unit were asked to rate the clarity and accuracy of the answers on a scale of 1-10 and state 'yes or no' if they would share the generated response with a patient.
All answers were scored as poor or very poor according to the DISCERN tool. The mean DISCERN score for all questions was 34. The responses also scored low in readability and understandability. The mean FKRE index was 33.6, and the CL index was 15.6. Clinicians working in plastics and reconstructive surgery rated the questions well in clarity and accuracy. The mean clarity score was 7.38, and the accuracy score was 7.4.
This study found that according to validated quality assessment tools, ChatGPT-4 produced low-quality information when asked about popular queries relating to plastic and aesthetic surgery. Furthermore, the information produced was pitched at a high reading level. However, the responses were still rated well in clarity and accuracy, according to clinicians working in plastic surgery. Although improvements need to be made, this study suggests that language models such as ChatGPT could be a useful starting point when developing written health information. With the expansion of AI, improvements in content quality are anticipated.
Keating M
,Bollard SM
,Potter S
《Cureus》
-
Evaluating ChatGPT to test its robustness as an interactive information database of radiation oncology and to assess its responses to common queries from radiotherapy patients: A single institution investigation.
Commercial vendors have created artificial intelligence (AI) tools for use in all aspects of life and medicine, including radiation oncology. AI innovations will likely disrupt workflows in the field of radiation oncology. However, limited data exist on using AI-based chatbots about the quality of radiation oncology information. This study aims to assess the accuracy of ChatGPT, an AI-based chatbot, in answering patients' questions during their first visit to the radiation oncology outpatient department and test knowledge of ChatGPT in radiation oncology.
Expert opinion was formulated using a set of ten standard questions of patients encountered in outpatient department practice. A blinded expert opinion was taken for the ten questions on common queries of patients in outpatient department visits, and the same questions were evaluated on ChatGPT version 3.5 (ChatGPT 3.5). The answers by expert and ChatGPT were independently evaluated for accuracy by three scientific reviewers. Additionally, a comparison was made for the extent of similarity of answers between ChatGPT and experts by a response scoring for each answer. Word count and Flesch-Kincaid readability score and grade were done for the responses obtained from expert and ChatGPT. A comparison of the answers of ChatGPT and expert was done with a Likert scale. As a second component of the study, we tested the technical knowledge of ChatGPT. Ten multiple choice questions were framed with increasing order of difficulty - basic, intermediate and advanced, and the responses were evaluated on ChatGPT. Statistical testing was done using SPSS version 27.
After expert review, the accuracy of expert opinion was 100%, and ChatGPT's was 80% (8/10) for regular questions encountered in outpatient department visits. A noticeable difference was observed in word count and readability of answers from expert opinion or ChatGPT. Of the ten multiple-choice questions for assessment of radiation oncology database, ChatGPT had an accuracy rate of 90% (9 out of 10). One answer to a basic-level question was incorrect, whereas all answers to intermediate and difficult-level questions were correct.
ChatGPT provides reasonably accurate information about routine questions encountered in the first outpatient department visit of the patient and also demonstrated a sound knowledge of the subject. The result of our study can inform the future development of educational tools in radiation oncology and may have implications in other medical fields. This is the first study that provides essential insight into the potentially positive capabilities of two components of ChatGPT: firstly, ChatGPT's response to common queries of patients at OPD visits, and secondly, the assessment of the radiation oncology knowledge base of ChatGPT.
Pandey VK
,Munshi A
,Mohanti BK
,Bansal K
,Rastogi K
... -
《-》
-
Harnessing artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing, and Bard in generating clinician-level bariatric surgery recommendations.
The formulation of clinical recommendations pertaining to bariatric surgery is essential in guiding healthcare professionals. However, the extensive and continuously evolving body of literature in bariatric surgery presents considerable challenge for staying abreast of latest developments and efficient information acquisition. Artificial intelligence (AI) has the potential to streamline access to the salient points of clinical recommendations in bariatric surgery.
The study aims to appraise the quality and readability of AI-chat-generated answers to frequently asked clinical inquiries in the field of bariatric and metabolic surgery.
Remote.
Question prompts inputted into AI large language models (LLMs) and were created based on pre-existing clinical practice guidelines regarding bariatric and metabolic surgery. The prompts were queried into 3 LLMs: OpenAI ChatGPT-4, Microsoft Bing, and Google Bard. The responses from each LLM were entered into a spreadsheet for randomized and blinded duplicate review. Accredited bariatric surgeons in North America independently assessed appropriateness of each recommendation using a 5-point Likert scale. Scores of 4 and 5 were deemed appropriate, while scores of 1-3 indicated lack of appropriateness. A Flesch Reading Ease (FRE) score was calculated to assess the readability of responses generated by each LLMs.
There was a significant difference between the 3 LLMs in their 5-point Likert scores, with mean values of 4.46 (SD .82), 3.89 (.80), and 3.11 (.72) for ChatGPT-4, Bard, and Bing (P < .001). There was a significant difference between the 3 LLMs in the proportion of appropriate answers, with ChatGPT-4 at 85.7%, Bard at 74.3%, and Bing at 25.7% (P < .001). The mean FRE scores for ChatGPT-4, Bard, and Bing, were 21.68 (SD 2.78), 42.89 (4.03), and 14.64 (5.09), respectively, with higher scores representing easier readability.
LLM-based AI chat models can effectively generate appropriate responses to clinical questions related to bariatric surgery, though the performance of different models can vary greatly. Therefore, caution should be taken when interpreting clinical information provided by LLMs, and clinician oversight is necessary to ensure accuracy. Future investigation is warranted to explore how LLMs might enhance healthcare provision and clinical decision-making in bariatric surgery.
Lee Y
,Shin T
,Tessier L
,Javidan A
,Jung J
,Hong D
,Strong AT
,McKechnie T
,Malone S
,Jin D
,Kroh M
,Dang JT
,ASMBS Artificial Intelligence and Digital Surgery Task Force
... -
《-》
-
ChatGPT as a patient education tool in colorectal cancer-An in-depth assessment of efficacy, quality and readability.
Artificial intelligence (AI) chatbots such as Chat Generative Pretrained Transformer-4 (ChatGPT-4) have made significant strides in generating human-like responses. Trained on an extensive corpus of medical literature, ChatGPT-4 has the potential to augment patient education materials. These chatbots may be beneficial to populations considering a diagnosis of colorectal cancer (CRC). However, the accuracy and quality of patient education materials are crucial for informed decision-making. Given workforce demands impacting holistic care, AI chatbots can bridge gaps in CRC information, reaching wider demographics and crossing language barriers. However, rigorous evaluation is essential to ensure accuracy, quality and readability. Therefore, this study aims to evaluate the efficacy, quality and readability of answers generated by ChatGPT-4 on CRC, utilizing patient-style question prompts.
To evaluate ChatGPT-4, eight CRC-related questions were derived using peer-reviewed literature and Google Trends. Eight colorectal surgeons evaluated AI responses for accuracy, safety, appropriateness, actionability and effectiveness. Quality was assessed using validated tools: the Patient Education Materials Assessment Tool (PEMAT-AI), modified DISCERN (DISCERN-AI) and Global Quality Score (GQS). A number of readability assessments were measured including Flesch Reading Ease (FRE) and the Gunning Fog Index (GFI).
The responses were generally accurate (median 4.00), safe (4.25), appropriate (4.00), actionable (4.00) and effective (4.00). Quality assessments rated PEMAT-AI as 'very good' (71.43), DISCERN-AI as 'fair' (12.00) and GQS as 'high' (4.00). Readability scores indicated difficulty (FRE 47.00, GFI 12.40), suggesting a higher educational level was required.
This study concludes that ChatGPT-4 is capable of providing safe but nonspecific medical information, suggesting its potential as a patient education aid. However, enhancements in readability through contextual prompting and fine-tuning techniques are required before considering implementation into clinical practice.
Siu AHY
,Gibson DP
,Chiu C
,Kwok A
,Irwin M
,Christie A
,Koh CE
,Keshava A
,Reece M
,Suen M
,Rickard MJFX
... -
《-》