Leveraging Large Language Models for Efficient Oncology Information Extraction: A Technical Tutorial

Introduction to the LLM-AIx Pipeline

In the field of medical oncology, a significant amount of critical patient information is locked away in unstructured text formats, such as clinical letters and procedure reports. This format presents a substantial barrier to quantitative analysis, making it difficult to derive actionable insights efficiently. Traditional methods of manual review or structured information retrieval are not only labor-intensive but also costly. However, the advent of Large Language Models (LLMs) has opened up new avenues in Natural Language Processing (NLP), offering powerful capabilities for structured Information Extraction (IE) from free-form medical text.

This document serves as a technical tutorial for the LLM-AIx pipeline, a novel workflow designed to extract predefined clinical entities from unstructured oncology text. A key feature of LLM-AIx is its utilization of privacy-preserving LLMs, ensuring that sensitive patient data is handled securely. This protocol directly addresses a significant hurdle in clinical research and patient care by enabling efficient and accurate information extraction, which in turn supports better decision-making and facilitates large-scale data analysis. Crucially, the pipeline is designed to run on local hospital infrastructure, thereby eliminating the need to transfer patient data externally, a vital consideration for data privacy and security.

We demonstrate the practical utility of the LLM-AIx pipeline through a specific use case: extracting TNM stage information from 100 pathology reports sourced from The Cancer Genome Atlas (TCGA). The pipeline is engineered for accessibility, requiring no programming skills and offering a user-friendly interface that allows for rapid, structured data extraction from clinical free text.

Integrating Open-Source Large Language Models

A significant advantage of the LLM-AIx pipeline is its inherent flexibility in integrating the latest open-source LLMs. This facilitates not only the deployment of powerful AI capabilities but also enables robust benchmarking of different models. Requestors can evaluate various LLMs to determine their accuracy in extracting specific entities and information tailored to their unique needs. Currently, the pipeline supports all models available in the Generative Pre-Trained Transformers (GPT)-Generated Unified Format (GGUF). This includes a wide array of popular and capable models such as Llama-2 (with 7 billion and 70 billion parameters), Llama-3.1 (8 billion and 70 billion parameters), Llama-2 “Sauerkraut” (70b), Phi-3, Mistral (7b), and many others. This broad compatibility ensures that users can leverage the most suitable model for their specific tasks and computational resources.

Furthermore, the pipeline is designed to produce outputs in a Comma-Separated Values (CSV) format. This standardized output format is crucial for enabling seamless integration with existing data analysis tools and established workflows. By providing data in a widely compatible format, LLM-AIx facilitates quantitative analysis without imposing the need for specialized computational skills, thereby democratizing access to advanced data processing capabilities.

Pipeline Validation and Performance

To rigorously validate the efficacy of the LLM-AIx protocol, comprehensive experiments were conducted across diverse datasets, encompassing multiple languages and clinical settings. Each use case was meticulously designed to test and showcase the protocol’s proficiency in accurately and efficiently transforming unstructured text into structured data, while simultaneously addressing specific clinical questions. The performance evaluation focused on key metrics including accuracy, sensitivity, specificity, F1-score, and precision. Maintaining data integrity and patient privacy were paramount throughout these validation procedures. All research activities adhered strictly to the Declaration of Helsinki, and ethics approval was obtained from the ethics committee of Technical University Dresden (reference number BO-EK-400092023).

Fictitious Test Cases for Validation and Reuse

To facilitate broader adoption and validation, the LLM-AIx pipeline incorporates fictitious test cases. These cases are designed to be representative of real-world scenarios, allowing users to test and reuse the pipeline for various information extraction tasks. Detailed descriptions of these procedures and their results are available, enabling users to understand the pipeline’s capabilities and limitations. The availability of these test cases, alongside the open-source nature of the software, promotes transparency and encourages community-driven improvements.

High Accuracy in Extracting Tumor Stage from Pathology Reports

A significant demonstration of the LLM-AIx pipeline’s capability lies in its performance on extracting crucial tumor staging information from pathology reports. In an experiment utilizing 100 pathology reports from The Cancer Genome Atlas (TCGA), the pipeline, powered by the Llama 3.1 70B parameter model, achieved an impressive overall accuracy of 87% across all extracted variables. Specifically, the extraction of the T-stage demonstrated an accuracy of 89% (F1 Score 0.57, Precision 52%, Recall 68%). The N-stage was extracted with 92% accuracy (F1 Score 0.86, Precision 85%, Recall 87%). The M-stage extraction yielded an accuracy of 82% (F1 Score 0.69, Precision 68%, Recall 93%).

Beyond the TNM staging, the pipeline also excelled in extracting other critical details. The number of lymph nodes examined was correctly extracted in 87% of cases, and the number of lymph nodes positive for cancer cells was accurately identified in 90%. The determination of whether the resection margin was tumor-free was achieved with 86% accuracy (F1 Score 0.92, Precision 87%, Recall 99%, FPR 93%, FNR 1%). Furthermore, the extraction of information regarding the presence or absence of lymphatic invasion reached an accuracy of 86% (F1 Score 0.82, Precision 70%, Recall 100%, FPR 21%, FNR 0%).

A key aspect of this experiment was the use of quantized models. These models integrate seamlessly with our llama.cpp-based pipeline and significantly reduce computational resource requirements. Notably, the 4-bit quantized Llama 3.1 70B model delivered accuracy comparable to the full-precision version. This reduction in memory usage, from 139 GB to 43 GB for the TCGA dataset example, makes the pipeline significantly more feasible for environments with limited computational resources, such as local hospital settings.

Identifying Main Error Drivers

Through rigorous analysis, the primary drivers of errors within the information extraction process were identified. These predominantly included conflicting data within the source reports and failures in Optical Character Recognition (OCR) when processing scanned documents. Understanding these error sources is crucial for ongoing pipeline refinement and for setting realistic expectations regarding extraction accuracy. Strategies to mitigate these issues, such as improving OCR pre-processing steps and developing more robust methods for handling ambiguous or conflicting information, are key areas for future development.

Overview of the LLM-AIx Protocol Workflow

The LLM-AIx protocol follows a structured workflow designed for clarity and efficiency. This workflow, visualized in Figure 1, encompasses several key stages:

Data Preprocessing: This initial stage involves preparing the input text data. It supports various file formats, including TXT, PDF, CSV, and Excel. For image-based documents like PDFs, Optical Character Recognition (OCR) may be employed. The system can automatically convert and compile diverse data formats into a uniform CSV format, ensuring consistency for subsequent analysis. Documents may also be split into smaller chunks if their length exceeds the context window of the chosen LLM.
LLM Settings Configuration: Users define critical parameters for the LLM inference. This includes selecting the desired LLM from a GGUF-compatible list, specifying hyperparameters such as temperature and the number of tokens to predict, and crafting a detailed prompt. The prompt engineering is crucial, often involving a two-part structure: providing background context to the model and clearly instructing it on the specific extraction task. For complex tasks, few-shot examples can be incorporated into the prompt to leverage the LLM’s in-context learning capabilities.
Grammar Specification: To ensure a consistent and structured output, users can define a JSON schema, referred to as the "Grammar," which dictates the desired output format. The pipeline includes a "Grammar Builder" tool that allows users to define feature names and their possible values (string, boolean, categories, number) without manual JSON editing, thereby reducing errors. This grammar configuration can be saved and loaded for reuse.
LLM-Based Information Extraction: Once all settings are configured and the preprocessed data is uploaded, the LLM processing begins. The selected model is loaded onto the local GPU, and the information extraction commences. Progress is indicated via a progress bar, and upon completion, a ZIP file containing the output CSV with LLM predictions and meta-information is generated.
Evaluation: The pipeline offers robust evaluation capabilities. Users can compare the LLM’s output against a human-annotated ground truth dataset. This comparison automatically generates performance metrics, including accuracy, sensitivity, specificity, and F1-scores, along with confusion matrices. These metrics provide a quantitative assessment of the pipeline’s performance for the specific task.

Setting Up the Pipeline: Equipment and Software

The LLM-AIx pipeline is designed for accessibility, with options for both Docker-based and manual setup, ensuring it can be implemented across various technical environments.

Option 1: Docker Pipeline Setup

For a streamlined setup, the pipeline can be deployed using Docker. This approach encapsulates all dependencies within a container, simplifying installation and ensuring consistency across different systems. The process involves:

Editing docker-compose.yml: Users need to modify the docker-compose.yml file to specify the correct path to their downloaded LLM model files.
Running the Docker Image: After configuring the docker-compose.yml file, the user can initiate the pipeline by running the Docker image as described in the project’s README.md file. This command typically involves using docker-compose up.

This method is recommended for users who prefer a quick and isolated setup, minimizing potential conflicts with existing system software.

Option 2: Manual Pipeline Setup

For users who prefer more control or have specific environmental configurations, a manual setup is also supported. This involves:

Creating a Virtual Environment: It is recommended to create a dedicated Python virtual environment to manage dependencies.
Installing Python Packages: All necessary Python packages, as detailed in the README.md file on the GitHub repository (), must be installed within the virtual environment.
Installing Additional Software: Depending on the specific functionalities required, additional software packages such as Tesseract (for OCR) and llama.cpp (for LLM inference) may need to be installed. The README.md file provides comprehensive instructions for these installations.

All software components, including Python packages, require a minimum version of Python 3.12.

Model Download

Regardless of the setup method chosen (Docker or manual), the first crucial step is downloading the desired LLM models in GGUF format. These models can be obtained from various sources, including Hugging Face. Popular choices include:

Meta-Llama-3.1-8B-Instruct-GGUF
Meta-Llama-3-8B-GGUF
Llama-2-7B-GGUF
Llama-2-70B-GGUF
microsoft/Phi-3-mini-4k-instruct-gguf

These models should be downloaded and stored in a designated local directory accessible by the pipeline.

Software Interface and Usability

The LLM-AIx pipeline is designed with a strong emphasis on user-friendliness, particularly for those without extensive programming backgrounds. The core functionality is accessible through a graphical user interface (GUI) that simplifies the complex process of information extraction.

For users who prefer a rapid setup, the pipeline is available as a Docker image. This image includes all necessary dependencies, simplifying the deployment process. The only external requirement is the LLM model files in GGUF format, which need to be downloaded separately.

For users opting for a manual setup, the process involves installing the required Python packages and additional software like Tesseract and llama.cpp, as detailed in the GitHub repository