Benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3.
摘要:
This study aimed to evaluate the accuracy, comprehensiveness, and readability of responses generated by various Large Language Models (LLMs) (ChatGPT-3.5, Gemini, Claude 3, and GPT-4.0) in the clinical context of uveitis, utilizing a meticulous grading methodology. Twenty-seven clinical uveitis questions were presented individually to four Large Language Models (LLMs): ChatGPT (versions GPT-3.5 and GPT-4.0), Google Gemini, and Claude. Three experienced uveitis specialists independently assessed the responses for accuracy using a three-point scale across three rounds with a 48-hour wash-out interval. The final accuracy rating for each LLM response ('Excellent', 'Marginal', or 'Deficient') was determined through a majority consensus approach. Comprehensiveness was evaluated using a three-point scale for responses rated 'Excellent' in the final accuracy assessment. Readability was determined using the Flesch-Kincaid Grade Level formula. Statistical analyses were conducted to discern significant differences among LLMs, employing a significance threshold of p < 0.05. Claude 3 and ChatGPT 4 demonstrated significantly higher accuracy compared to Gemini (p < 0.001). Claude 3 also showed the highest proportion of 'Excellent' ratings (96.3%), followed by ChatGPT 4 (88.9%). ChatGPT 3.5, Claude 3, and ChatGPT 4 had no responses rated as 'Deficient', unlike Gemini (14.8%) (p = 0.014). ChatGPT 4 exhibited greater comprehensiveness compared to Gemini (p = 0.008), and Claude 3 showed higher comprehensiveness compared to Gemini (p = 0.042). Gemini showed significantly better readability compared to ChatGPT 3.5, Claude 3, and ChatGPT 4 (p < 0.001). Gemini also had fewer words, letter characters, and sentences compared to ChatGPT 3.5 and Claude 3. Our study highlights the outstanding performance of Claude 3 and ChatGPT 4 in providing precise and thorough information regarding uveitis, surpassing Gemini. ChatGPT 4 and Claude 3 emerge as pivotal tools in improving patient understanding and involvement in their uveitis healthcare journey.
收起
展开
DOI:
10.1038/s41433-024-03545-9
被引量:
年份:
1970


通过 文献互助 平台发起求助,成功后即可免费获取论文全文。
求助方法1:
知识发现用户
每天可免费求助50篇
求助方法1:
关注微信公众号
每天可免费求助2篇
求助方法2:
完成求助需要支付5财富值
您目前有 1000 财富值
相似文献(100)
参考文献(23)
引证文献(3)
来源期刊
影响因子:暂无数据
JCR分区: 暂无
中科院分区:暂无