Automatic Personalized Impression Generation for PET Reports Using Large Language Models(LLMs)

Task:  Automatic Impression Generation for PET Reports Using Large Language Models (LLMs)

Description: A collection of twelve open-source language models fine-tuned on a corpus of 37,370 retrospective PET reports were collected from University of Wisconsin, Madison, WI, USA between 2010 and 2022. All models were trained using the teacher-forcing algorithm, with the report findings and patient information as input and the original clinical impressions as reference. An extra input token encoded the reading physician’s identity, allowing models to learn physician-specific reporting styles.

Models Comparative Evaluation: To compare the performances of different LLMs, various automatic 30 evaluation metrics were computed and benchmarked against physician preferences. The winner LLM’s impressions were then evaluated for their clinical utility, against original clinical impressions, by three nuclear medicine physicians across 6 quality dimensions (3-point scales) and an overall utility score (5-point scale). Each physician reviewed 12 of their own reports and 12 reports from other physicians.

The 12 evaluated fine-tuned LLMs can be found below:

The categories of all 30 evaluation metrics employed for the LLM performance comparison are summarized below:

CategoryDefinitionCorresponding Evaluation Metrics
Lexical overlap-based metricsThese metrics measure the overlap between the generated text and the reference in terms of textual units, such as n-grams or word sequencesROUGE-1, ROUGE-2, ROUGE-3, ROUGE-L, ROUGE-LSUM, BLEU, CHRF, METEOR, CIDEr
Embedding-based metricsThese metrics measure the semantic similarity between the generated and reference texts using pretrained embeddingsROUGE-WE-1, ROUGE-WE-2, ROUGE-WE-3, BERTScore, MoverScore
Graph-based metricsThese metrics construct graphs using entities and their relations extracted from the sentences, and evaluate the summary based on these graphsRadGraph
Text generation-based metricsThese metrics assess the quality of generated text by framing it as a text generation task using sequence-to-sequence language modelsBARTScore, BARTScore + PET PEGASUSScore + PET, T5Score + PET, PRISM
Supervised regression-based metricsThese metrics require human annotations to train a parametrized regression model to predict human judgments for the given textS3-pyr, S3-resp
Question answering-based metricsThese metrics formulate the evaluation process as a question-answering task by guiding the model with various questionsUniEval
Reference-free metricsThese metrics do not require the reference text to assess the quality of the generated text. Instead, they compare the generated text against the source documentSummaQA, BLANC, SUPERT, Stats-compression, Stats-coverage, Stats-density, Stats-novel trigram
Categories of all evaluation metrics employed for the LLM’s performance comparison. Note that 17 different evaluation methods are used to assess model performance. Given that each method might encompass multiple variants, there are a total of 30 metrics (obtained from comparative study’s reference publication)

The implementation methods employed by the comparative evaluation study are shared on GitHub:

Both encoder-decoder and decoder-only language models were evaluated. Considering their different architectures, input templates were customized as illustrated in below figure. For encoder-decoder models, the first lines describe the categories of PET scans, while the second lines encode each reading physician’s identity using an identifier token.

Formatting of reports for input to encoder-decoder and decoder-only models. For encoder-decoder models, the first two lines describe the examination category and encode the reading physician’s identity. “Findings” contains the clinical findings from the PET report, and “Indication” includes the patient’s background information. It is noted that for decoder-only models, the instruction-tuning method was employed and adapted the prompt from Alpaca. (obtained from comparative study’s publication)

Results: On average, the PEGASUS LLM quantitative evaluation scores outperformed all other LLMs. When physicians assessed PEGASUS LLM impressions generated in their own style, 89% were considered clinically acceptable, with a mean utility score of 4.08/5. On average, physicians rated these personalized impressions as comparable in overall utility to the impressions dictated by other physicians (4.03, P = 0.41). In summary, the comparative evaluation study demonstrated that personalized impressions generated by PEGASUS LLM were clinically useful in most cases, highlighting its potential to expedite PET reporting by automatically drafting impressions.

Performance ofthe 12 large language models (LLMs) evaluated by the 30 metrics referenced above. The X-axis displays the metrics arranged in descending order of correlation with physician preferences. For each evaluation metric, values underwent min–max normalization to allow comparison within a single plot. The star denotes the best model for each metric, and the circle denotes the other models that do not have statistically significant difference (P > 0.05) with the best model (obtained from comparative study’s publication)

Claim: The fine-tuned large language model provides clinically useful, personalized impressions based on PET findings. To the best of the authors’ knowledge, this is the first attempt to automate impression generation for whole-body PET reports.

Key Points (from GitHub documentation comparing the 12 fine-tuned LLMs):

  • 📈 Among 30 evaluation metrics, domain-adapted BARTScore and PEGASUSScore exhibited the highest correlations (Spearman’s ρ correlation=0.568 and 0.563) with physician preferences, yet they did not reach the level of inter-reader correlation (ρ=0.654).
  • 🏆 Of all fine-tuned large language models, encoder-decoder models outperformed decoder-only models, with PEGASUS emerging as the top-performing model.
  • 🏅 In the reader study, three nuclear medicine physicians considered the overall utility of personalized PEGASUS-generated impressions to be comparable to clinical impressions dictated by other physicians.

PEGASUS-PET was fine-tuned based on Google’s PEGASUS LLM implementation.

Data Availability Statement: The radiology reports used in this study are not publicly available due to privacy concerns related to HIPAA. However, upon reasonable request and approval of a data use agreement, they can be made available for research purposes. COG AHOD1331 clinical trial data is archived in NCTN Data Archive.

GitHub Page

Hugging Face Page

Documentation: Usage, PET Human Experts Report Evaluation

Reference Publication

Leave a comment