Large Language Models Encode Clinical Knowledge

Med-PaLM paper from Google/DeepdMind that evaluates the performance of PaLM and its specialized variant, Flan-PaLM, on MultiMedQA datasets. Flan-PaLM achieves remarkable state-of-the-art accuracy across all datasets, notably surpassing the prior state-of-the-art on MedQA by over 17%. However, human evaluation reveals limitations in Flan-PaLM's responses.

To address these gaps, DeepMind introduces instruction prompt tuning, resulting in Med-PaLM. While Med-PaLM shows promise, it remains inferior to clinical expertise. The study highlights improvements in comprehension, knowledge recall, and medical reasoning with model scale and instruction prompt tuning, showcasing LLMs' potential in medicine. However, it underscores the need for robust evaluation frameworks and ongoing method development for safe and helpful LLM models in clinical applications.

https://arxiv.org/pdf/2212.13138.pdf

3