-
Glaucoma Detection and Feature Identification via GPT-4V Fundus Image Analysis.
The aim is to assess GPT-4V's (OpenAI) diagnostic accuracy and its capability to identify glaucoma-related features compared to expert evaluations.
Evaluation of multimodal large language models for reviewing fundus images in glaucoma.
A total of 300 fundus images from 3 public datasets (ACRIMA, ORIGA, and RIM-One v3) that included 139 glaucomatous and 161 nonglaucomatous cases were analyzed.
Preprocessing ensured each image was centered on the optic disc. GPT-4's vision-preview model (GPT-4V) assessed each image for various glaucoma-related criteria: image quality, image gradability, cup-to-disc ratio, peripapillary atrophy, disc hemorrhages, rim thinning (by quadrant and clock hour), glaucoma status, and estimated probability of glaucoma. Each image was analyzed twice by GPT-4V to evaluate consistency in its predictions. Two expert graders independently evaluated the same images using identical criteria. Comparisons between GPT-4V's assessments, expert evaluations, and dataset labels were made to determine accuracy, sensitivity, specificity, and Cohen kappa.
The main parameters measured were the accuracy, sensitivity, specificity, and Cohen kappa of GPT-4V in detecting glaucoma compared with expert evaluations.
GPT-4V successfully provided glaucoma assessments for all 300 fundus images across the datasets, although approximately 35% required multiple prompt submissions. GPT-4V's overall accuracy in glaucoma detection was slightly lower (0.68, 0.70, and 0.81, respectively) than that of expert graders (0.78, 0.80, and 0.88, for expert grader 1 and 0.72, 0.78, and 0.87, for expert grader 2, respectively), across the ACRIMA, ORIGA, and RIM-ONE datasets. In Glaucoma detection, GPT-4V showed variable agreement by dataset and expert graders, with Cohen kappa values ranging from 0.08 to 0.72. In terms of feature detection, GPT-4V demonstrated high consistency (repeatability) in image gradability, with an agreement accuracy of ≥89% and substantial agreement in rim thinning and cup-to-disc ratio assessments, although kappas were generally lower than expert-to-expert agreement.
GPT-4V shows promise as a tool in glaucoma screening and detection through fundus image analysis, demonstrating generally high agreement with expert evaluations of key diagnostic features, although agreement did vary substantially across datasets.
Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
Jalili J
,Jiravarnsirikul A
,Bowd C
,Chuter B
,Belghith A
,Goldbaum MH
,Baxter SL
,Weinreb RN
,Zangwill LM
,Christopher M
... -
《-》
-
Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study.
Recent advancements in artificial intelligence, such as GPT-3.5 Turbo (OpenAI) and GPT-4, have demonstrated significant potential by achieving good scores on text-only United States Medical Licensing Examination (USMLE) exams and effectively answering questions from physicians. However, the ability of these models to interpret medical images remains underexplored.
This study aimed to comprehensively evaluate the performance, interpretability, and limitations of GPT-3.5 Turbo, GPT-4, and its successor, GPT-4 Vision (GPT-4V), specifically focusing on GPT-4V's newly introduced image-understanding feature. By assessing the models on medical licensing examination questions that require image interpretation, we sought to highlight the strengths and weaknesses of GPT-4V in handling complex multimodal clinical information, thereby exposing hidden flaws and providing insights into its readiness for integration into clinical settings.
This cross-sectional study tested GPT-4V, GPT-4, and ChatGPT-3.5 Turbo on a total of 227 multiple-choice questions with images from USMLE Step 1 (n=19), Step 2 clinical knowledge (n=14), Step 3 (n=18), the Diagnostic Radiology Qualifying Core Exam (DRQCE) (n=26), and AMBOSS question banks (n=150). AMBOSS provided expert-written hints and question difficulty levels. GPT-4V's accuracy was compared with 2 state-of-the-art large language models, GPT-3.5 Turbo and GPT-4. The quality of the explanations was evaluated by choosing human preference between an explanation by GPT-4V (without hint), an explanation by an expert, or a tie, using 3 qualitative metrics: comprehensive explanation, question information, and image interpretation. To better understand GPT-4V's explanation ability, we modified a patient case report to resemble a typical "curbside consultation" between physicians.
For questions with images, GPT-4V achieved an accuracy of 84.2%, 85.7%, 88.9%, and 73.1% in Step 1, Step 2 clinical knowledge, Step 3 of USMLE, and DRQCE, respectively. It outperformed GPT-3.5 Turbo (42.1%, 50%, 50%, 19.2%) and GPT-4 (63.2%, 64.3%, 66.7%, 26.9%). When GPT-4V answered correctly, its explanations were nearly as good as those provided by domain experts from AMBOSS. However, incorrect answers often had poor explanation quality: 18.2% (10/55) contained inaccurate text, 45.5% (25/55) had inference errors, and 76.3% (42/55) demonstrated image misunderstandings. With human expert assistance, GPT-4V reduced errors by an average of 40% (22/55). GPT-4V accuracy improved with hints, maintaining stable performance across difficulty levels, while medical student performance declined as difficulty increased. In a simulated curbside consultation scenario, GPT-4V required multiple specific prompts to interpret complex case data accurately.
GPT-4V achieved high accuracy on multiple-choice questions with images, highlighting its potential in medical assessments. However, significant shortcomings were observed in the quality of explanations when questions were answered incorrectly, particularly in the interpretation of images, which could not be efficiently resolved through expert interaction. These findings reveal hidden flaws in the image interpretation capabilities of GPT-4V, underscoring the need for more comprehensive evaluations beyond multiple-choice questions before integrating GPT-4V into clinical settings.
Yang Z
,Yao Z
,Tasmin M
,Vashisht P
,Jang WS
,Ouyang F
,Wang B
,McManus D
,Berlowitz D
,Yu H
... -
《JOURNAL OF MEDICAL INTERNET RESEARCH》
-
Evaluating ChatGPT-4's Diagnostic Accuracy: Impact of Visual Data Integration.
In the evolving field of health care, multimodal generative artificial intelligence (AI) systems, such as ChatGPT-4 with vision (ChatGPT-4V), represent a significant advancement, as they integrate visual data with text data. This integration has the potential to revolutionize clinical diagnostics by offering more comprehensive analysis capabilities. However, the impact on diagnostic accuracy of using image data to augment ChatGPT-4 remains unclear.
This study aims to assess the impact of adding image data on ChatGPT-4's diagnostic accuracy and provide insights into how image data integration can enhance the accuracy of multimodal AI in medical diagnostics. Specifically, this study endeavored to compare the diagnostic accuracy between ChatGPT-4V, which processed both text and image data, and its counterpart, ChatGPT-4, which only uses text data.
We identified a total of 557 case reports published in the American Journal of Case Reports from January 2022 to March 2023. After excluding cases that were nondiagnostic, pediatric, and lacking image data, we included 363 case descriptions with their final diagnoses and associated images. We compared the diagnostic accuracy of ChatGPT-4V and ChatGPT-4 without vision based on their ability to include the final diagnoses within differential diagnosis lists. Two independent physicians evaluated their accuracy, with a third resolving any discrepancies, ensuring a rigorous and objective analysis.
The integration of image data into ChatGPT-4V did not significantly enhance diagnostic accuracy, showing that final diagnoses were included in the top 10 differential diagnosis lists at a rate of 85.1% (n=309), comparable to the rate of 87.9% (n=319) for the text-only version (P=.33). Notably, ChatGPT-4V's performance in correctly identifying the top diagnosis was inferior, at 44.4% (n=161), compared with 55.9% (n=203) for the text-only version (P=.002, χ2 test). Additionally, ChatGPT-4's self-reports showed that image data accounted for 30% of the weight in developing the differential diagnosis lists in more than half of cases.
Our findings reveal that currently, ChatGPT-4V predominantly relies on textual data, limiting its ability to fully use the diagnostic potential of visual information. This study underscores the need for further development of multimodal generative AI systems to effectively integrate and use clinical image data. Enhancing the diagnostic performance of such AI systems through improved multimodal data integration could significantly benefit patient care by providing more accurate and comprehensive diagnostic insights. Future research should focus on overcoming these limitations, paving the way for the practical application of advanced AI in medicine.
Hirosawa T
,Harada Y
,Tokumasu K
,Ito T
,Suzuki T
,Shimizu T
... -
《JMIR Medical Informatics》
-
Toward Foundation Models in Radiology? Quantitative Assessment of GPT-4V's Multimodal and Multianatomic Region Capabilities.
Strotzer QD
,Nieberle F
,Kupke LS
,Napodano G
,Muertz AK
,Meiler S
,Einspieler I
,Rennert J
,Strotzer M
,Wiesinger I
,Wendl C
,Stroszczynski C
,Hamer OW
,Schicho A
... -
《-》
-
Diagnostic Performance of the Offline Medios Artificial Intelligence for Glaucoma Detection in a Rural Tele-Ophthalmology Setting.
This study assesses the diagnostic efficacy of offline Medios Artificial Intelligence (AI) glaucoma software in a primary eye care setting, using nonmydriatic fundus images from Remidio's Fundus-on-Phone (FOP NM-10). Artificial intelligence results were compared with tele-ophthalmologists' diagnoses and with a glaucoma specialist's assessment for those participants referred to a tertiary eye care hospital.
Prospective cross-sectional study PARTICIPANTS: Three hundred three participants from 6 satellite vision centers of a tertiary eye hospital.
At the vision center, participants underwent comprehensive eye evaluations, including clinical history, visual acuity measurement, slit lamp examination, intraocular pressure measurement, and fundus photography using the FOP NM-10 camera. Medios AI-Glaucoma software analyzed 42-degree disc-centric fundus images, categorizing them as normal, glaucoma, or suspect. Tele-ophthalmologists who were glaucoma fellows with a minimum of 3 years of ophthalmology and 1 year of glaucoma fellowship training, masked to artificial intelligence (AI) results, remotely diagnosed subjects based on the history and disc appearance. All participants labeled as disc suspects or glaucoma by AI or tele-ophthalmologists underwent further comprehensive glaucoma evaluation at the base hospital, including clinical examination, Humphrey visual field analysis, and OCT. Artificial intelligence and tele-ophthalmologist diagnoses were then compared with a glaucoma specialist's diagnosis.
Sensitivity and specificity of Medios AI.
Out of 303 participants, 299 with at least one eye of sufficient image quality were included in the study. The remaining 4 participants did not have sufficient image quality in both eyes. Medios AI identified 39 participants (13%) with referable glaucoma. The AI exhibited a sensitivity of 0.91 (95% confidence interval [CI]: 0.71-0.99) and specificity of 0.93 (95% CI: 0.89-0.96) in detecting referable glaucoma (definite perimetric glaucoma) when compared to tele-ophthalmologist. The agreement between AI and the glaucoma specialist was 80.3%, surpassing the 55.3% agreement between the tele-ophthalmologist and the glaucoma specialist amongst those participants who were referred to the base hospital. Both AI and the tele-ophthalmologist relied on fundus photos for diagnoses, whereas the glaucoma specialist's assessments at the base hospital were aided by additional tools such as Humphrey visual field analysis and OCT. Furthermore, AI had fewer false positive referrals (2 out of 10) compared to the tele-ophthalmologist (9 out of 10).
Medios offline AI exhibited promising sensitivity and specificity in detecting referable glaucoma from remote vision centers in southern India when compared with teleophthalmologists. It also demonstrated better agreement with glaucoma specialist's diagnosis for referable glaucoma participants.
Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
Upadhyaya S
,Rao DP
,Kavitha S
,Ballae Ganeshrao S
,Negiloni K
,Bhandary S
,Savoy FM
,Venkatesh R
... -
《-》