Artificial Intelligence Excels in Healthcare: GPT-3.5 and 4 Demonstrate Exceptional Clinical Reasoning

2024-01-29

In a recent study published in the journal "npj Digital Medicine," researchers explored the ability of large language models (LLMs) to simulate diagnostic clinical reasoning. LLMs are AI-based systems trained on a large amount of text data and have shown excellent performance in tasks such as writing clinical records and passing medical exams. However, the key to determining their integration into clinical care lies in understanding their clinical diagnostic reasoning abilities. This study focused primarily on open-ended clinical questions and highlighted the potential of innovative large language models like GPT-4 to identify complex patients. Engineering played a crucial role in this field as the performance of LLMs varied depending on the type of prompt and question. The researchers' investigation primarily targeted the GPT-3.5 and GPT-4 models, evaluating their diagnostic reasoning abilities on open-ended clinical questions. The researchers hypothesized that GPT models could surpass traditional chain of thought (CoT) prompts through diagnostic reasoning prompts. The researchers used the revised MedQA United States Medical Licensing Examination (USMLE) dataset and the New England Journal of Medicine (NEJM) case series as data sources. These datasets covered cognitive processes such as forming differential diagnoses, analytical reasoning, Bayesian reasoning, and intuitive reasoning. The researchers examined whether large language models could simulate clinical reasoning skills using specialized prompts combined with clinical expertise and advanced prompting techniques. Through prompt engineering, the researchers generated diagnostic reasoning prompts designed to eliminate multiple-choice options and transform questions into free-text format. The researchers selected only Step 2 and Step 3 questions from the USMLE dataset and questions evaluating patient diagnoses for evaluation. To assess the accuracy of LLMs, the researchers evaluated GPT-3.5 using the MedQA training set. The training and test sets consisted of 95 and 518 questions, respectively, which were reserved for evaluation. The researchers also evaluated the performance of GPT-4 on 310 cases published in the NEJM. The researchers excluded 10 cases without a clear final diagnosis or exceeding the maximum context length limit of GPT-4. The researchers compared traditional CoT prompts with the clinical diagnostic reasoning CoT prompt that performed best on the MedQA dataset. Each prompt included two example questions that used target reasoning techniques or small-sample learning to explain the rationale. The research evaluation used free-text questions from the USMLE and NEJM case report series for a rigorous comparison between prompt strategies. Physician authors, attending physicians, and an internal medicine resident evaluated the language models' responses, with each question assessed by two blinded reviewers. A third researcher resolved any discrepancies, and if necessary, software was used to verify the accuracy of the answers. The researchers' findings indicated that GPT-4's prompts successfully simulated clinical reasoning abilities without compromising diagnostic accuracy. This is crucial for improving the credibility of LLMs in patient care and bringing them closer to safe and effective medical applications. This approach helps overcome the black box limitations of LLMs and makes them more suitable for safe and effective medical use. Under standard CoT prompts, GPT-3.5 accurately answered 46% of the evaluation questions. With non-chain of thought prompts, it accurately answered 31% of the questions. In prompts related to clinical diagnostic reasoning, GPT-3.5 performed best in intuitive reasoning (48% compared to 46%). However, compared to the classical chain of thought, it performed significantly worse in analytical reasoning prompts (40%) and formulating differential diagnosis prompts (38%), while Bayesian reasoning did not reach significance (42%). GPT-4's performance was relatively superior to GPT-3.5, demonstrating higher accuracy. It showed accuracies of 76%, 77%, 78%, 78%, and 72% in classical chain of thought, intuitive reasoning, differential diagnosis reasoning, analytical reasoning prompts, and Bayesian reasoning, respectively. In the MedQA evaluation, GPT-4 achieved 99% inter-rater agreement. Additionally, GPT-4 performed well on the NEJM dataset, scoring 38% compared to traditional CoT and 34% in formulating differential diagnoses.