-
Large Language Models Take on Cardiothoracic Surgery: A Comparative Analysis of the Performance of Four Models on American Board of Thoracic Surgery Exam Questions in 2023.
Objectives Large language models (LLMs), for example, ChatGPT, have performed exceptionally well in various fields. Of note, their success in answering postgraduate medical examination questions has been previously reported, indicating their possible utility in surgical education and training. This study evaluated the performance of four different LLMs on the American Board of Thoracic Surgery's (ABTS) Self-Education and Self-Assessment in Thoracic Surgery (SESATS) XIII question bank to investigate the potential applications of these LLMs in the education and training of future surgeons. Methods The dataset in this study comprised 400 best-of-four questions from the SESATS XIII exam. This included 220 adult cardiac surgery questions, 140 general thoracic surgery questions, 20 congenital cardiac surgery questions, and 20 cardiothoracic critical care questions. The GPT-3.5 (OpenAI, San Francisco, CA) and GPT-4 (OpenAI) models were evaluated, as well as Med-PaLM 2 (Google Inc., Mountain View, CA) and Claude 2 (Anthropic Inc., San Francisco, CA), and their respective performances were compared. The subspecialties included were adult cardiac, general thoracic, congenital cardiac, and critical care. Questions requiring visual information, such as clinical images or radiology, were excluded. Results GPT-4 demonstrated a significant improvement over GPT-3.5 overall (87.0% vs. 51.8% of questions answered correctly, p < 0.0001). GPT-4 also exhibited consistently improved performance across all subspecialties, with accuracy rates ranging from 70.0% to 90.0%, compared to 35.0% to 60.0% for GPT-3.5. When using the GPT-4 model, ChatGPT performed significantly better on the adult cardiac and general thoracic subspecialties (p < 0.0001). Conclusions Large language models, such as ChatGPT with the GPT-4 model, demonstrate impressive skill in understanding complex cardiothoracic surgical clinical information, achieving an overall accuracy rate of nearly 90.0% on the SESATS question bank. Our study shows significant improvement between successive GPT iterations. As LLM technology continues to evolve, its potential use in surgical education, training, and continuous medical education is anticipated to enhance patient outcomes and safety in the future.
Khalpey Z
,Kumar U
,King N
,Abraham A
,Khalpey AH
... -
《Cureus》
-
Stratified Evaluation of GPT's Question Answering in Surgery Reveals Artificial Intelligence (AI) Knowledge Gaps.
Murphy Lonergan R
,Curry J
,Dhas K
,Simmons BI
... -
《Cureus》
-
ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models.
This study aimed to assess the performance of ChatGPT, specifically the GPT-3.5 and GPT-4 models, in understanding complex surgical clinical information and its potential implications for surgical education and training.
The dataset comprised 280 questions from the Korean general surgery board exams conducted between 2020 and 2022. Both GPT-3.5 and GPT-4 models were evaluated, and their performances were compared using McNemar test.
GPT-3.5 achieved an overall accuracy of 46.8%, while GPT-4 demonstrated a significant improvement with an overall accuracy of 76.4%, indicating a notable difference in performance between the models (P < 0.001). GPT-4 also exhibited consistent performance across all subspecialties, with accuracy rates ranging from 63.6% to 83.3%.
ChatGPT, particularly GPT-4, demonstrates a remarkable ability to understand complex surgical clinical information, achieving an accuracy rate of 76.4% on the Korean general surgery board exam. However, it is important to recognize the limitations of large language models and ensure that they are used in conjunction with human expertise and judgment.
Oh N
,Choi GS
,Lee WY
《-》
-
Comparing the Performance of Popular Large Language Models on the National Board of Medical Examiners Sample Questions.
Large language models (LLMs) have transformed various domains in medicine, aiding in complex tasks and clinical decision-making, with OpenAI's GPT-4, GPT-3.5, Google's Bard, and Anthropic's Claude among the most widely used. While GPT-4 has demonstrated superior performance in some studies, comprehensive comparisons among these models remain limited. Recognizing the significance of the National Board of Medical Examiners (NBME) exams in assessing the clinical knowledge of medical students, this study aims to compare the accuracy of popular LLMs on NBME clinical subject exam sample questions.
The questions used in this study were multiple-choice questions obtained from the official NBME website and are publicly available. Questions from the NBME subject exams in medicine, pediatrics, obstetrics and gynecology, clinical neurology, ambulatory care, family medicine, psychiatry, and surgery were used to query each LLM. The responses from GPT-4, GPT-3.5, Claude, and Bard were collected in October 2023. The response by each LLM was compared to the answer provided by the NBME and checked for accuracy. Statistical analysis was performed using one-way analysis of variance (ANOVA).
A total of 163 questions were queried by each LLM. GPT-4 scored 163/163 (100%), GPT-3.5 scored 134/163 (82.2%), Bard scored 123/163 (75.5%), and Claude scored 138/163 (84.7%). The total performance of GPT-4 was statistically superior to that of GPT-3.5, Claude, and Bard by 17.8%, 15.3%, and 24.5%, respectively. The total performance of GPT-3.5, Claude, and Bard was not significantly different. GPT-4 significantly outperformed Bard in specific subjects, including medicine, pediatrics, family medicine, and ambulatory care, and GPT-3.5 in ambulatory care and family medicine. Across all LLMs, the surgery exam had the highest average score (18.25/20), while the family medicine exam had the lowest average score (3.75/5). Conclusion: GPT-4's superior performance on NBME clinical subject exam sample questions underscores its potential in medical education and practice. While LLMs exhibit promise, discernment in their application is crucial, considering occasional inaccuracies. As technological advancements continue, regular reassessments and refinements are imperative to maintain their reliability and relevance in medicine.
Abbas A
,Rehman MS
,Rehman SS
《Cureus》
-
Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard.
Large language models (LLMs) have revolutionized natural language processing with their ability to generate human-like text through extensive training on large data sets. These models, including Generative Pre-trained Transformers (GPT)-3.5 (OpenAI), GPT-4 (OpenAI), and Bard (Google LLC), find applications beyond natural language processing, attracting interest from academia and industry. Students are actively leveraging LLMs to enhance learning experiences and prepare for high-stakes exams, such as the National Eligibility cum Entrance Test (NEET) in India.
This comparative analysis aims to evaluate the performance of GPT-3.5, GPT-4, and Bard in answering NEET-2023 questions.
In this paper, we evaluated the performance of the 3 mainstream LLMs, namely GPT-3.5, GPT-4, and Google Bard, in answering questions related to the NEET-2023 exam. The questions of the NEET were provided to these artificial intelligence models, and the responses were recorded and compared against the correct answers from the official answer key. Consensus was used to evaluate the performance of all 3 models.
It was evident that GPT-4 passed the entrance test with flying colors (300/700, 42.9%), showcasing exceptional performance. On the other hand, GPT-3.5 managed to meet the qualifying criteria, but with a substantially lower score (145/700, 20.7%). However, Bard (115/700, 16.4%) failed to meet the qualifying criteria and did not pass the test. GPT-4 demonstrated consistent superiority over Bard and GPT-3.5 in all 3 subjects. Specifically, GPT-4 achieved accuracy rates of 73% (29/40) in physics, 44% (16/36) in chemistry, and 51% (50/99) in biology. Conversely, GPT-3.5 attained an accuracy rate of 45% (18/40) in physics, 33% (13/26) in chemistry, and 34% (34/99) in biology. The accuracy consensus metric showed that the matching responses between GPT-4 and Bard, as well as GPT-4 and GPT-3.5, had higher incidences of being correct, at 0.56 and 0.57, respectively, compared to the matching responses between Bard and GPT-3.5, which stood at 0.42. When all 3 models were considered together, their matching responses reached the highest accuracy consensus of 0.59.
The study's findings provide valuable insights into the performance of GPT-3.5, GPT-4, and Bard in answering NEET-2023 questions. GPT-4 emerged as the most accurate model, highlighting its potential for educational applications. Cross-checking responses across models may result in confusion as the compared models (as duos or a trio) tend to agree on only a little over half of the correct responses. Using GPT-4 as one of the compared models will result in higher accuracy consensus. The results underscore the suitability of LLMs for high-stakes exams and their positive impact on education. Additionally, the study establishes a benchmark for evaluating and enhancing LLMs' performance in educational tasks, promoting responsible and informed use of these models in diverse learning environments.
Farhat F
,Chaudhry BM
,Nadeem M
,Sohail SS
,Madsen DØ
... -
《-》