Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study.
ChatGPT has shown impressive performance in national medical licensing examinations, such as the United States Medical Licensing Examination (USMLE), even passing it with expert-level performance. However, there is a lack of research on its performance in low-income countries' national licensing medical examinations. In Peru, where almost one out of three examinees fails the national licensing medical examination, ChatGPT has the potential to enhance medical education.
We aimed to assess the accuracy of ChatGPT using GPT-3.5 and GPT-4 on the Peruvian National Licensing Medical Examination (Examen Nacional de Medicina [ENAM]). Additionally, we sought to identify factors associated with incorrect answers provided by ChatGPT.
We used the ENAM 2022 data set, which consisted of 180 multiple-choice questions, to evaluate the performance of ChatGPT. Various prompts were used, and accuracy was evaluated. The performance of ChatGPT was compared to that of a sample of 1025 examinees. Factors such as question type, Peruvian-specific knowledge, discrimination, difficulty, quality of questions, and subject were analyzed to determine their influence on incorrect answers. Questions that received incorrect answers underwent a three-step process involving different prompts to explore the potential impact of adding roles and context on ChatGPT's accuracy.
GPT-4 achieved an accuracy of 86% on the ENAM, followed by GPT-3.5 with 77%. The accuracy obtained by the 1025 examinees was 55%. There was a fair agreement (κ=0.38) between GPT-3.5 and GPT-4. Moderate-to-high-difficulty questions were associated with incorrect answers in the crude and adjusted model for GPT-3.5 (odds ratio [OR] 6.6, 95% CI 2.73-15.95) and GPT-4 (OR 33.23, 95% CI 4.3-257.12). After reinputting questions that received incorrect answers, GPT-3.5 went from 41 (100%) to 12 (29%) incorrect answers, and GPT-4 from 25 (100%) to 4 (16%).
Our study found that ChatGPT (GPT-3.5 and GPT-4) can achieve expert-level performance on the ENAM, outperforming most of our examinees. We found fair agreement between both GPT-3.5 and GPT-4. Incorrect answers were associated with the difficulty of questions, which may resemble human performance. Furthermore, by reinputting questions that initially received incorrect answers with different prompts containing additional roles and context, ChatGPT achieved improved accuracy.
Flores-Cohaila JA
,García-Vicente A
,Vizcarra-Jiménez SF
,De la Cruz-Galán JP
,Gutiérrez-Arratia JD
,Quiroga Torres BG
,Taype-Rondan A
... -
《-》
Performance of Artificial Intelligence Chatbots on Glaucoma Questions Adapted From Patient Brochures.
Introduction With the potential for artificial intelligence (AI) chatbots to serve as the primary source of glaucoma information to patients, it is essential to characterize the information that chatbots provide such that providers can tailor discussions, anticipate patient concerns, and identify misleading information. Therefore, the purpose of this study was to evaluate glaucoma information from AI chatbots, including ChatGPT-4, Bard, and Bing, by analyzing response accuracy, comprehensiveness, readability, word count, and character count in comparison to each other and glaucoma-related American Academy of Ophthalmology (AAO) patient materials. Methods Section headers from AAO glaucoma-related patient education brochures were adapted into question form and asked five times to each AI chatbot (ChatGPT-4, Bard, and Bing). Two sets of responses from each chatbot were used to evaluate the accuracy of AI chatbot responses and AAO brochure information, and the comprehensiveness of AI chatbot responses compared to the AAO brochure information, scored 1-5 by three independent glaucoma-trained ophthalmologists. Readability (assessed with Flesch-Kincaid Grade Level (FKGL), corresponding to the United States school grade levels), word count, and character count were determined for all chatbot responses and AAO brochure sections. Results Accuracy scores for AAO, ChatGPT, Bing, and Bard were 4.84, 4.26, 4.53, and 3.53, respectively. On direct comparison, AAO was more accurate than ChatGPT (p=0.002), and Bard was the least accurate (Bard versus AAO, p<0.001; Bard versus ChatGPT, p<0.002; Bard versus Bing, p=0.001). ChatGPT had the most comprehensive responses (ChatGPT versus Bing, p<0.001; ChatGPT versus Bard p=0.008), with comprehensiveness scores for ChatGPT, Bing, and Bard at 3.32, 2.16, and 2.79, respectively. AAO information and Bard responses were at the most accessible readability levels (AAO versus ChatGPT, AAO versus Bing, Bard versus ChatGPT, Bard versus Bing, all p<0.0001), with readability levels for AAO, ChatGPT, Bing, and Bard at 8.11, 13.01, 11.73, and 7.90, respectively. Bing responses had the lowest word and character count. Conclusion AI chatbot responses varied in accuracy, comprehensiveness, and readability. With accuracy scores and comprehensiveness below that of AAO brochures and elevated readability levels, AI chatbots require improvements to be a more useful supplementary source of glaucoma information for patients. Physicians must be aware of these limitations such that patients are asked about existing knowledge and questions and are then provided with clarifying and comprehensive information.
Yalla GR
,Hyman N
,Hock LE
,Zhang Q
,Shukla AG
,Kolomeyer NN
... -
《Cureus》
Analysing the Applicability of ChatGPT, Bard, and Bing to Generate Reasoning-Based Multiple-Choice Questions in Medical Physiology.
Background Artificial intelligence (AI) is evolving in the medical education system. ChatGPT, Google Bard, and Microsoft Bing are AI-based models that can solve problems in medical education. However, the applicability of AI to create reasoning-based multiple-choice questions (MCQs) in the field of medical physiology is yet to be explored. Objective We aimed to assess and compare the applicability of ChatGPT, Bard, and Bing in generating reasoning-based MCQs for MBBS (Bachelor of Medicine, Bachelor of Surgery) undergraduate students on the subject of physiology. Methods The National Medical Commission of India has developed an 11-module physiology curriculum with various competencies. Two physiologists independently chose a competency from each module. The third physiologist prompted all three AIs to generate five MCQs for each chosen competency. The two physiologists who provided the competencies rated the MCQs generated by the AIs on a scale of 0-3 for validity, difficulty, and reasoning ability required to answer them. We analyzed the average of the two scores using the Kruskal-Wallis test to compare the distribution across the total and module-wise responses, followed by a post-hoc test for pairwise comparisons. We used Cohen's Kappa (Κ) to assess the agreement in scores between the two raters. We expressed the data as a median with an interquartile range. We determined their statistical significance by a p-value <0.05. Results ChatGPT and Bard generated 110 MCQs for the chosen competencies. However, Bing provided only 100 MCQs as it failed to generate them for two competencies. The validity of the MCQs was rated as 3 (3-3) for ChatGPT, 3 (1.5-3) for Bard, and 3 (1.5-3) for Bing, showing a significant difference (p<0.001) among the models. The difficulty of the MCQs was rated as 1 (0-1) for ChatGPT, 1 (1-2) for Bard, and 1 (1-2) for Bing, with a significant difference (p=0.006). The required reasoning ability to answer the MCQs was rated as 1 (1-2) for ChatGPT, 1 (1-2) for Bard, and 1 (1-2) for Bing, with no significant difference (p=0.235). K was ≥ 0.8 for all three parameters across all three AI models. Conclusion AI still needs to evolve to generate reasoning-based MCQs in medical physiology. ChatGPT, Bard, and Bing showed certain limitations. Bing generated significantly least valid MCQs, while ChatGPT generated significantly least difficult MCQs.
Agarwal M
,Sharma P
,Goswami A
《Cureus》