Machine Learning Model to Predict COVID-19 severity for patients

To this day, the novel coronavirus disease continues to pose a critical threat to global health. The outbreak has spread exponentially, from the first reported case in October 2019 in China's Hubei province to June 2021, with over 170 million worldwide reported cases and nearly 4 million deaths (reference). As the vaccine rollout continues to surge and economies start to reopen, it’s still worthy to note many underdeveloped countries are still at risk as the pandemic continues to challenge medical systems worldwide. Although the COVID-19 survival rate is relatively high compared to past pandemics (mortality rate of 2% of the total closed cases), the number of severe cases resulting in casualties increases daily. Furthermore, if there were methods to determine which patients are most at risk through early identification, this would significantly impact patients' survival rate.

This is where the work of Sumayh S. Aljameel, Irfan Ullah Khan, Nida Aslam, Malak Aljabri, and Eman S. Alsulmi from the College of Computer Science and Information Technology in Saudi Arabia comes into fruition. The aforementioned data scientists created a machine learning model to identify at-risk patients as an early warning system. The model, generated from the King Fahad University Hospital's patient data, aims to assist all healthcare workers in taking preventive measures by identifying patients that will need the most care before any autoimmune failure caused by COVID. I wanted to focus on this work and the work of so many others to demonstrate how artificial intelligence and ML techniques can play a crucial role in reducing the spread of this virus and for the ones to follow.

Summary of Work and Preprocessing Steps Taken

The data examined contained demographic and clinical data for positive COVID-19 patients from 04/30/2020 to 07/24/2020 of the King Fahad University Hospital. From the 287 records, 243 patients survived, 44 patients’ did not. Sadly this meant they could develop labels for a binary classification problem.

The group of features included symptom-based ones (body temperature at admission/discharge, shortness of breath, fever, or cough present) as well as any history of chronic diseases. Not having much of a medical background, it was interesting to see the sorts of variables considered helpful in a demographic/biological problem.

One of the real combatants of this disease is the accidental dispersion caused by asymptomatic carriers. From their original dataset, only 5% of the patients were asymptomatic at the initial diagnosis. They considered these patients and other symptom features in the range of 2–49 % frequency as a part of the sym_others attribute.

SMOTE was used to address the large class imbalance between deceased and survived patients. This ensured the model could distinguish the underlying pattern of the data. The SMOTE SkLearn module aims to balance class distribution by randomly increasing minority class examples by replicating them, it will then synthesize new minority instances between existing minority instances.

Predictive Models

After all preprocessing steps had been completed, three classification algorithms for Model creation were deployed(logistic regression, Random Forest and XGBoost). This blog will now dive into detail about the corresponding algorithms and their varying success with the study.

Logistic Regression

Logistic regression is an elementary classification algorithm deployed when dealing with a dichotomous classification label (binary problem). Instead of fitting a straight line to the data, logistic regression attempts to fit an S-shaped Sigmoid curve to our observations. As seen in the above picture, the curve itself ranges from [0,1] meaning that a threshold line made along y = 0.5 is created to classify observations. By computing the Sigmoid function of X, we get a probability of an observation belonging to one of the two categories (whether a patient will survive or not).

Logistic Regression Parameters

With Logistic regression, we can also apply some regularization parameter that reduces the chance of overfitting (L2 lasso). Through regularization, coefficient estimates reduce towards zero. In other words this technique discourages learning a more complex or flexible model to avoid the risk of overfitting (a model that learns too much of the noise accompanied with the training dataset and isn’t able to make accurate predictions when encountering new data). After conducting a grid search, the scientists found the above parameters were the optimal combination for this specific model (should I explain each param?).

Logistic Regression AUC score

Random Forest

Random Forest is an ensemble machine learning algorithm that can be used for either regression or classification tasks. An ensemble method combines a group of weak learners (ML algorithms that return an accuracy slightly greater than an average guess) to develop a more robust and accurate model. Random Forests consist of many decision trees taken from random samples of the training data set (each consisting of random features and observations for the training set); the algorithm proceeds to look at each tree in parallel. Afterwards, it obtains the results of each decision tree and aggregates them via a voting system to determine the overall prediction.

Random Forest grid parameters

For this problem, the Saudi data scientists used 100 decision trees with a maximum depth of 15 nodes; this set of parameters again was obtained via a grid search and displayed outstanding classification metrics, as seen below.

Random Forest AUC score


XGBoost is an ensemble learning method similar to a random forest that combines simple one split decision trees to develop a ubiquitous fast, and efficient Machine learning model commonly used in the data science world. However, XGBoost differs from random forest via the meta ensemble algorithm used to examine the basic decision trees, boosting rather than bagging. Through the boosting method, the model will look at each decision tree sequentially to boost the attributes of the prior tree until some threshold has been reached.

XGBoost Parameters

Being the regularized form of the gradient boosting algorithm, XGBoost is a beloved machine learning predictive algorithm. The above set of parameters again gave this XGBoost model the best performance.

XGBoost AUC Score


Being a newcomer to data science, it’s amazing to see what tangible outcomes can be made using machine learning models. This pandemic has taken so many innocent lives; projects like the one examined in this blog are one of the many tools now at the disposal of health care workers to counteract the spread and prevent further loss. The work of these data scientists has inspired me to seek ways of aiding those who suffered severely over the last year and a half via some data science solution. As to what that pertains to, subscribe to find out :-).