Sci Rep
. 2025 Nov 28;15(1):42712.
doi: 10.1038/s41598-025-26705-7. Large language models versus classical machine learning performance in COVID-19 mortality prediction using high-dimensional tabular data
Mohammadreza Ghaffarzadeh-Esfahani 1 2 , Mahdi Ghaffarzadeh-Esfahani 2 , Aryan Salahi-Niri 1 , Hossein Toreyhi 1 , Zahra Atf 3 , Amirali Mohsenzadeh-Kermani 2 , Mahshad Sarikhani 4 , Zohreh Tajabadi 5 , Fatemeh Shojaeian 6 , Mohammad Hassan Bagheri 2 , Aydin Feyzi 7 , Mohamadamin Tarighat-Payma 4 , Narges Gazmeh 7 , Fateme Heydari 4 , Hossein Afshar 7 , Amirreza Allahgholipour 7 , Farid Alimardani 7 , Ameneh Salehi 4 , Naghmeh Asadimanesh 4 , Mohammad Amin Khalafi 4 , Hadis Shabanipour 7 , Ali Moradi 7 , Sajjad Hossein Zadeh 7 , Omid Yazdani 4 , Romina Esbati 4 , Moozhan Maleki 7 , Danial Samiei Nasr 4 , Amirali Soheili 4 , Hossein Majlesi 4 , Saba Shahsavan 4 , Alireza Soheilipour 4 , Nooshin Goudarzi 1 , Erfan Taherifard 8 , Hamidreza Hatamabadi 9 , Jamil S Samaan 10 , Thomas Savage 11 , Ankit Sakhuja 12 , Ali Soroush 12 , Girish Nadkarni 12 , Ilad Alavi Darazam 13 14 , Mohamad Amin Pourhoseingholi 15 16 , Seyed Amir Ahmad Safavi-Naini 17 18
Affiliations
This study compared the performance of classical feature-based machine learning models (CMLs) and large language models (LLMs) in predicting COVID-19 mortality using high-dimensional tabular data from 9,134 patients across four hospitals. Seven CML models, including XGBoost and random forest (RF), were evaluated alongside eight LLMs, such as GPT-4 and Mistral-7b, which performed zero-shot classification on text-converted structured data. Additionally, Mistral-7b was fine-tuned using the QLoRA approach. XGBoost and RF demonstrated superior performance among CMLs, achieving F1 scores of 0.87 and 0.83 for internal and external validation, respectively. GPT-4 led the LLM category with an F1 score of 0.43, while fine-tuning Mistral-7b significantly improved its recall from 1% to 79%, yielding a stable F1 score of 0.74 during external validation. Although LLMs showed moderate performance in zero-shot classification, fine-tuning substantially enhanced their effectiveness, potentially bridging the gap with CML models. However, CMLs still outperformed LLMs in handling high-dimensional tabular data tasks. This study highlights the potential of both CMLs and fine-tuned LLMs in medical predictive modeling, while emphasizing the current superiority of CMLs for structured data analysis.
Keywords: COVID-19 mortality; Fine-tuning; Large language models; Machine learning; Structured data; Zero-shot classification.
. 2025 Nov 28;15(1):42712.
doi: 10.1038/s41598-025-26705-7. Large language models versus classical machine learning performance in COVID-19 mortality prediction using high-dimensional tabular data
Mohammadreza Ghaffarzadeh-Esfahani 1 2 , Mahdi Ghaffarzadeh-Esfahani 2 , Aryan Salahi-Niri 1 , Hossein Toreyhi 1 , Zahra Atf 3 , Amirali Mohsenzadeh-Kermani 2 , Mahshad Sarikhani 4 , Zohreh Tajabadi 5 , Fatemeh Shojaeian 6 , Mohammad Hassan Bagheri 2 , Aydin Feyzi 7 , Mohamadamin Tarighat-Payma 4 , Narges Gazmeh 7 , Fateme Heydari 4 , Hossein Afshar 7 , Amirreza Allahgholipour 7 , Farid Alimardani 7 , Ameneh Salehi 4 , Naghmeh Asadimanesh 4 , Mohammad Amin Khalafi 4 , Hadis Shabanipour 7 , Ali Moradi 7 , Sajjad Hossein Zadeh 7 , Omid Yazdani 4 , Romina Esbati 4 , Moozhan Maleki 7 , Danial Samiei Nasr 4 , Amirali Soheili 4 , Hossein Majlesi 4 , Saba Shahsavan 4 , Alireza Soheilipour 4 , Nooshin Goudarzi 1 , Erfan Taherifard 8 , Hamidreza Hatamabadi 9 , Jamil S Samaan 10 , Thomas Savage 11 , Ankit Sakhuja 12 , Ali Soroush 12 , Girish Nadkarni 12 , Ilad Alavi Darazam 13 14 , Mohamad Amin Pourhoseingholi 15 16 , Seyed Amir Ahmad Safavi-Naini 17 18
Affiliations
- PMID: 41315569
- PMCID: PMC12663554
- DOI: 10.1038/s41598-025-26705-7
This study compared the performance of classical feature-based machine learning models (CMLs) and large language models (LLMs) in predicting COVID-19 mortality using high-dimensional tabular data from 9,134 patients across four hospitals. Seven CML models, including XGBoost and random forest (RF), were evaluated alongside eight LLMs, such as GPT-4 and Mistral-7b, which performed zero-shot classification on text-converted structured data. Additionally, Mistral-7b was fine-tuned using the QLoRA approach. XGBoost and RF demonstrated superior performance among CMLs, achieving F1 scores of 0.87 and 0.83 for internal and external validation, respectively. GPT-4 led the LLM category with an F1 score of 0.43, while fine-tuning Mistral-7b significantly improved its recall from 1% to 79%, yielding a stable F1 score of 0.74 during external validation. Although LLMs showed moderate performance in zero-shot classification, fine-tuning substantially enhanced their effectiveness, potentially bridging the gap with CML models. However, CMLs still outperformed LLMs in handling high-dimensional tabular data tasks. This study highlights the potential of both CMLs and fine-tuned LLMs in medical predictive modeling, while emphasizing the current superiority of CMLs for structured data analysis.
Keywords: COVID-19 mortality; Fine-tuning; Large language models; Machine learning; Structured data; Zero-shot classification.