Announcement

Collapse
No announcement yet.

PLoS One . A comparative analysis of large language models versus traditional information extraction methods for real-world evidence of patient symptomatology in acute and post-acute sequelae of SARS-CoV-2

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • PLoS One . A comparative analysis of large language models versus traditional information extraction methods for real-world evidence of patient symptomatology in acute and post-acute sequelae of SARS-CoV-2

    PLoS One


    . 2025 May 15;20(5):e0323535.
    doi: 10.1371/journal.pone.0323535. eCollection 2025. A comparative analysis of large language models versus traditional information extraction methods for real-world evidence of patient symptomatology in acute and post-acute sequelae of SARS-CoV-2

    Vedansh Thakkar 1 2 , Greg M Silverman 1 2 , Abhinab Kc 3 , Nicholas E Ingraham 4 5 , Emma K Jones 1 , Samantha King 6 , Genevieve B Melton 1 2 5 , Rui Zhang 1 2 5 , Christopher J Tignanelli 1 2 5



    AffiliationsAbstract

    Background: Patient symptoms, crucial for disease progression and diagnosis, are often captured in unstructured clinical notes. Large language models (LLMs) offer potential advantages in extracting patient symptoms compared to traditional rule-based information extraction (IE) systems.
    Methods: This study compared fine-tuned LLMs (LLaMA2-13B and LLaMA3-8B) against BioMedICUS, a rule-based IE system, for extracting symptoms related to acute and post-acute sequelae of SARS-CoV-2 from clinical notes. The study utilized three corpora: UMN-COVID, UMN-PASC, and N3C-COVID. Prevalence, keyword and fairness analyses were conducted to assess symptom distribution and model equity across demographics.
    Results: BioMedICUS outperformed fine-tuned LLMs in most cases. On the UMN PASC dataset, BioMedICUS achieved a macro-averaged F1-score of 0.70 for positive mention detection, compared to 0.66 for LLaMA2-13B and 0.62 for LLaMA3-8B. For the N3C COVID dataset, BioMedICUS scored 0.75, while LLaMA2-13B and LLaMA3-8B scored 0.53 and 0.68, respectively for positive mention detection. However, LLMs performed better in specific instances, such as detecting positive mentions of change in sleep in the UMN PASC dataset, where LLaMA2-13B (0.79) and LLaMA3-8B (0.65) outperformed BioMedICUS (0.60). For fairness analysis, BioMedICUS generally showed stronger performance across patient demographics. Keyword analysis using ANOVA on symptom distributions across all three corpora showed that both corpus (df = 2, p < 0.001) and symptom (df = 79, p < 0.001) have a statistically significant effect on log-transformed term frequency-inverse document frequency (TF-IDF) values such that corpus accounts for 52% of the variance in log_tfidf values and symptom accounts for 35%.
    Conclusion: While BioMedICUS generally outperformed the LLMs, the latter showed promising results in specific areas, particularly LLaMA3-8B, in identifying negative symptom mentions. However, both LLaMA models faced challenges in demographic fairness and generalizability. These findings underscore the need for diverse, high-quality training datasets and robust annotation processes to enhance LLMs' performance and reliability in clinical applications.


Working...
X