CancerLLM: Large Language Model in the Field of Cancer

2024-09-10

In the field of medical natural language processing (NLP), large-scale language models (LLMs) such as ClinicalCamel 70B and Llama3-OpenBioLLM 70B have demonstrated outstanding performance. However, there is a lack of specialized and efficient models for the specific domain of cancer. These powerful models with massive parameters impose high computational requirements on medical systems. In response to this situation, a research team from the University of Minnesota, Yale University, and other institutions has collaborated to launch CancerLLM, a 7 billion parameter language model based on the Mistral architecture, specifically designed for cancer. The goal is to bring revolutionary changes to cancer treatment in a smaller and more efficient manner. The birth of CancerLLM marks an important step towards intelligent cancer diagnosis and treatment. This model not only deeply integrates professional knowledge in the field of cancer but also draws rich practical experience from over 2.6 million clinical records and 500,000 pathology reports through pre-training and fine-tuning. It covers 17 types of cancer. CancerLLM's performance in key tasks such as cancer phenotype extraction and diagnostic generation is remarkable, with an F1 score improvement of 7.61% compared to existing models. It also demonstrates extraordinary robustness in handling counterfactual scenarios and spelling errors. The workflow of CancerLLM is carefully designed, aiming to enhance the model's practicality and accuracy from the injection of cancer-specific knowledge to instruction tuning. Utilizing detailed data from 31,465 patients, CancerLLM can accurately identify tumor size, type, and stage, generate accurate diagnostic reports, and even propose personalized treatment plans. CancerLLM exhibits advantages over similar models in multiple evaluation metrics such as Exact Match, BLEU-2, and ROUGE-L, especially in finding the perfect balance between resource consumption and generation efficiency. Notably, CancerLLM excels in the task of cancer diagnostic generation. Despite the decent capabilities of benchmark models like Bio-Mistral 7B, CancerLLM comprehensively surpasses them with its profound domain knowledge and fine-tuning strategies. Even when facing formidable models such as Llama3-OpenBioLLM-70B and ClinicalCamel-70B, CancerLLM proves its worth and once again demonstrates the importance of domain knowledge in improving model performance. In the field of cancer phenotype extraction, CancerLLM also exhibits extraordinary capabilities. Although ClinicalCamel-70B temporarily leads in F1 score, its large size limits its widespread application. In contrast, CancerLLM, with its compact and efficient characteristics, can still maintain excellent performance in resource-constrained environments, competing with larger models. However, the exploration of CancerLLM does not stop here. The research team points out that there is still room for improvement in the accuracy of more complex diagnostic generation tasks, such as ICD-based diagnostic coding. In addition, high-quality data annotation, meticulous data preprocessing, and specialized optimizations for spelling errors and contextual misunderstandings will be key to enhancing CancerLLM's diagnostic capabilities in the future. In conclusion, the emergence of CancerLLM injects new vitality into the intelligent process of cancer diagnosis and treatment. It is not only an efficient medical LLM but also an important force driving precision medicine forward. With the continuous progress of technology and the deepening of applications, we have reason to believe that CancerLLM will play a more important role in future cancer treatment, bringing good news to more patients.