Surgical Domain-Specific LLM Concordance with NASS Guidelines for Adult Neoplastic Vertebral Fractures: A Comparison of Prompt Engineering Approaches

Surgical Domain-Specific LLM Concordance with NASS Guidelines for Adult Neoplastic Vertebral Fractures: A Comparison of Prompt Engineering Approaches

Author(s): Brandon L. Staple, Elijah M. Staple, Cynthia Wallace, Bevan D. Staple

In spine care, large language models offer promising potential for interpreting complex North American Spine Society (NASS) clinical guidelines. Clinical practice guidelines (CPGs) represent the cornerstone of evidence-based medicine, yet their interpretation and consistent application remain challenging due to complexity, evolving evidence bases, and contextual variability. Standard large language models (sLLMs) demonstrate unreliable performance when interpreting clinical practice guidelines, particularly with zero-shot prompting, due to frequent hallucinations that limit their utility in evidence-based medical decision-making. Domain-Specific Large Language Models (dLLMs) incorporating Retrieval-Augmented Generation (RAG) technology offer a promising solution by integrating external medical knowledge. When combined with Knowledge-Infused (KI) prompting—which concatenates relevant recommendation knowledge with specific guideline questions as contextual prompt—these systems can anchor model responses and reduce hallucinations. This study compares hallucination rates between KI and ZS prompt engineering using Verif.ai, a medical-based, RAG-embedded, dLLM. The goal was to generate low-hallucination recommendations aligned with NASS guidelines for diagnosing and treating adults with neoplastic vertebral fractures. Twenty-two guideline questions were reformulated using both prompting strategies and statistically evaluated. The results show that KI prompting achieved a higher overall concordance (95%) compared to ZS prompting (73%). Performance differences were most notable in Definition and Natural History (100% vs. 50%), Interventional Treatment (88% vs. 50%), and Surgical Treatment (100% vs. 75%) categories. KI prompting excelled with clear guidelines (80% vs. 40%) and maintained superiority in scenarios with evidence limitations (100% vs. 82%). The outperformance of KI over ZS is attributed to several factors: KI's incorporation of specific clinical evidence and terminology provides contextual anchoring and aligns with specialized medical language, thereby reducing the model's tendency to generate inaccurate or "hallucinated" information. Additionally, KI effectively narrows the hypothesis space by constraining the range of possible responses the model can generate. This focused approach enhances the model's ability to accurately communicate evidentiary limitations, particularly in complex and ambiguous clinical scenarios. Thus, integrating Verif.ai’s RAG capabilities with KI prompting significantly improves guideline efficacy over ZS prompting through its robustness in minimizing errors in language model-assisted clinical decision-making, a factor pivotal for spine care.

Journal Menu

Abstracting and Indexing

Surgical Domain-Specific LLM Concordance with NASS Guidelines for Adult Neoplastic Vertebral Fractures: A Comparison of Prompt Engineering Approaches

Abstract Views: 374

NLM ID

Journal Statistics

Grant Support Articles

Journal of Spine Research and Surgery

Fortune Journals

Quick Links

Contact Us