-
Is Artificial Intelligence ageist?
Aranda Rubio Y
,Baztán Cortés JJ
,Canillas Del Rey F
《European Geriatric Medicine》
-
Assessment of readability, reliability, and quality of ChatGPT®, BARD®, Gemini®, Copilot®, Perplexity® responses on palliative care.
There is no study that comprehensively evaluates data on the readability and quality of "palliative care" information provided by artificial intelligence (AI) chatbots ChatGPT®, Bard®, Gemini®, Copilot®, Perplexity®. Our study is an observational and cross-sectional original research study. In our study, AI chatbots ChatGPT®, Bard®, Gemini®, Copilot®, and Perplexity® were asked to present the answers of the 100 questions most frequently asked by patients about palliative care. Responses from each 5 AI chatbots were analyzed separately. This study did not involve any human participants. Study results revealed significant differences between the readability assessments of responses from all 5 AI chatbots (P < .05). According to the results of our study, when different readability indexes were evaluated holistically, the readability of AI chatbot responses was evaluated as Bard®, Copilot®, Perplexity®, ChatGPT®, Gemini®, from easy to difficult (P < .05). In our study, the median readability indexes of each of the 5 AI chatbots Bard®, Copilot®, Perplexity®, ChatGPT®, Gemini® responses were compared to the "recommended" 6th grade reading level. According to the results of our study answers of all 5 AI chatbots were compared with the 6th grade reading level, statistically significant differences were observed in the all formulas (P < .001). The answers of all 5 artificial intelligence robots were determined to be at an educational level well above the 6th grade level. The modified DISCERN and Journal of American Medical Association scores was found to be the highest in Perplexity® (P < .001). Gemini® responses were found to have the highest Global Quality Scale score (P < .001). It is emphasized that patient education materials should have a readability level of 6th grade level. Of the 5 AI chatbots whose answers about palliative care were evaluated, Bard®, Copilot®, Perplexity®, ChatGPT®, Gemini®, their current answers were found to be well above the recommended levels in terms of readability of text content. Text content quality assessment scores are also low. Both the quality and readability of texts should be brought to appropriate recommended limits.
Hancı V
,Ergün B
,Gül Ş
,Uzun Ö
,Erdemir İ
,Hancı FB
... -
《-》
-
Can artificial intelligence models serve as patient information consultants in orthodontics?
To evaluate the accuracy, reliability, quality, and readability of responses generated by ChatGPT-3.5, ChatGPT-4, Gemini, and Copilot in relation to orthodontic clear aligners.
Frequently asked questions by patients/laypersons about clear aligners on websites were identified using the Google search tool and these questions were posed to ChatGPT-3.5, ChatGPT-4, Gemini, and Copilot AI models. Responses were assessed using a five-point Likert scale for accuracy, the modified DISCERN scale for reliability, the Global Quality Scale (GQS) for quality, and the Flesch Reading Ease Score (FRES) for readability.
ChatGPT-4 responses had the highest mean Likert score (4.5 ± 0.61), followed by Copilot (4.35 ± 0.81), ChatGPT-3.5 (4.15 ± 0.75) and Gemini (4.1 ± 0.72). The difference between the Likert scores of the chatbot models was not statistically significant (p > 0.05). Copilot had a significantly higher modified DISCERN and GQS score compared to both Gemini, ChatGPT-4 and ChatGPT-3.5 (p < 0.05). Gemini's modified DISCERN and GQS score was statistically higher than ChatGPT-3.5 (p < 0.05). Gemini also had a significantly higher FRES compared to both ChatGPT-4, Copilot and ChatGPT-3.5 (p < 0.05). The mean FRES was 38.39 ± 11.56 for ChatGPT-3.5, 43.88 ± 10.13 for ChatGPT-4 and 41.72 ± 10.74 for Copilot, indicating that the responses were difficult to read according to the reading level. The mean FRES for Gemini is 54.12 ± 10.27, indicating that Gemini's responses are more readable than other chatbots.
All chatbot models provided generally accurate, moderate reliable and moderate to good quality answers to questions about the clear aligners. Furthermore, the readability of the responses was difficult. ChatGPT, Gemini and Copilot have significant potential as patient information tools in orthodontics, however, to be fully effective they need to be supplemented with more evidence-based information and improved readability.
Dursun D
,Bilici Geçer R
《BMC Medical Informatics and Decision Making》
-
The performance of artificial intelligence large language model-linked chatbots in surgical decision-making for gastroesophageal reflux disease.
Large language model (LLM)-linked chatbots may be an efficient source of clinical recommendations for healthcare providers and patients. This study evaluated the performance of LLM-linked chatbots in providing recommendations for the surgical management of gastroesophageal reflux disease (GERD).
Nine patient cases were created based on key questions addressed by the Society of American Gastrointestinal and Endoscopic Surgeons (SAGES) guidelines for the surgical treatment of GERD. ChatGPT-3.5, ChatGPT-4, Copilot, Google Bard, and Perplexity AI were queried on November 16th, 2023, for recommendations regarding the surgical management of GERD. Accurate chatbot performance was defined as the number of responses aligning with SAGES guideline recommendations. Outcomes were reported with counts and percentages.
Surgeons were given accurate recommendations for the surgical management of GERD in an adult patient for 5/7 (71.4%) KQs by ChatGPT-4, 3/7 (42.9%) KQs by Copilot, 6/7 (85.7%) KQs by Google Bard, and 3/7 (42.9%) KQs by Perplexity according to the SAGES guidelines. Patients were given accurate recommendations for 3/5 (60.0%) KQs by ChatGPT-4, 2/5 (40.0%) KQs by Copilot, 4/5 (80.0%) KQs by Google Bard, and 1/5 (20.0%) KQs by Perplexity, respectively. In a pediatric patient, surgeons were given accurate recommendations for 2/3 (66.7%) KQs by ChatGPT-4, 3/3 (100.0%) KQs by Copilot, 3/3 (100.0%) KQs by Google Bard, and 2/3 (66.7%) KQs by Perplexity. Patients were given appropriate guidance for 2/2 (100.0%) KQs by ChatGPT-4, 2/2 (100.0%) KQs by Copilot, 1/2 (50.0%) KQs by Google Bard, and 1/2 (50.0%) KQs by Perplexity.
Gastrointestinal surgeons, gastroenterologists, and patients should recognize both the promise and pitfalls of LLM's when utilized for advice on surgical management of GERD. Additional training of LLM's using evidence-based health information is needed.
Huo B
,Calabrese E
,Sylla P
,Kumar S
,Ignacio RC
,Oviedo R
,Hassan I
,Slater BJ
,Kaiser A
,Walsh DS
,Vosburg W
... -
《-》
-
Comparative accuracy of ChatGPT-4, Microsoft Copilot and Google Gemini in the Italian entrance test for healthcare sciences degrees: a cross-sectional study.
Artificial intelligence (AI) chatbots are emerging educational tools for students in healthcare science. However, assessing their accuracy is essential prior to adoption in educational settings. This study aimed to assess the accuracy of predicting the correct answers from three AI chatbots (ChatGPT-4, Microsoft Copilot and Google Gemini) in the Italian entrance standardized examination test of healthcare science degrees (CINECA test). Secondarily, we assessed the narrative coherence of the AI chatbots' responses (i.e., text output) based on three qualitative metrics: the logical rationale behind the chosen answer, the presence of information internal to the question, and presence of information external to the question.
An observational cross-sectional design was performed in September of 2023. Accuracy of the three chatbots was evaluated for the CINECA test, where questions were formatted using a multiple-choice structure with a single best answer. The outcome is binary (correct or incorrect). Chi-squared test and a post hoc analysis with Bonferroni correction assessed differences among chatbots performance in accuracy. A p-value of < 0.05 was considered statistically significant. A sensitivity analysis was performed, excluding answers that were not applicable (e.g., images). Narrative coherence was analyzed by absolute and relative frequencies of correct answers and errors.
Overall, of the 820 CINECA multiple-choice questions inputted into all chatbots, 20 questions were not imported in ChatGPT-4 (n = 808) and Google Gemini (n = 808) due to technical limitations. We found statistically significant differences in the ChatGPT-4 vs Google Gemini and Microsoft Copilot vs Google Gemini comparisons (p-value < 0.001). The narrative coherence of AI chatbots revealed "Logical reasoning" as the prevalent correct answer (n = 622, 81.5%) and "Logical error" as the prevalent incorrect answer (n = 40, 88.9%).
Our main findings reveal that: (A) AI chatbots performed well; (B) ChatGPT-4 and Microsoft Copilot performed better than Google Gemini; and (C) their narrative coherence is primarily logical. Although AI chatbots showed promising accuracy in predicting the correct answer in the Italian entrance university standardized examination test, we encourage candidates to cautiously incorporate this new technology to supplement their learning rather than a primary resource.
Not required.
Rossettini G
,Rodeghiero L
,Corradi F
,Cook C
,Pillastrini P
,Turolla A
,Castellini G
,Chiappinotto S
,Gianola S
,Palese A
... -
《-》