iCHD — Predicting Heart Disease

Predicting the 10-year risk of developing Coronary Heart Disease using Machine Learning algorithms.

15 min readMar 15, 2021

Every 48 seconds, 1 American dies as a result of heart disease.

That’s insane.

That works out to around 657,000 deaths per year in just the United States.

For reference, just under 600,000 people die annually from cancer. That makes heart disease the leading cause of death.

However, unlike cancer, for which there is currently no cure, heart disease is considered highly preventable. Studies have found that a combination of eating a healthy diet, getting regular physical exercise, and abstaining from smoking can preclude the onset of 80% of premature heart disease.

Failure to adopt a healthy diet is the top contributor to the top disease globally.

Be that as it may, many don’t implement these preventative measures, leaving themselves at risk of developing heart disease. After all, what incentive do these people have to follow these potentially life-saving practices if they don’t feel at risk of developing heart disease in the first place? With a population in excess of 330 million people, perhaps many Americans just automatically make the assumption that heart disease couldn’t possibly happen to them.

But they’re wrong. It definitely could, if they’re not careful. We need an incentive to persuade individuals at risk to take the required precautions.

But, here’s the big problem with that approach: due to the multifactorial nature of the CHD risk factors, we don’t have a robust way of figuring out exactly who is at risk. Our medical system currently has no reliable strategy to disincentivize people who adopt habits that put them at risk, beyond telling these individuals: “you’re adopting habits that put you at risk”. That’s not alarming enough of a warning to turn the tide in our battle against CHD.

Here’s where machine learning enters the equation. Using the Framingham Heart Study as the data source, we can leverage various ML algorithms to develop a screening tool capable of predicting whether a given patient has a 10-year risk of developing coronary heart disease (CHD). If we’re able to develop an accurate enough screening tool, we can warn patients with a sufficiently worrying CHD risk profile, by telling them: “our [xx]% accurate model anticipates that unless you scale back on the junk food consumption and exercise reluctance, you will get heart disease.”

**‘Holy shit, I need to stop doing these things, or I’m going to die young’**, the patient begins thinking to himself.

Although no studies have confirmed this as far as I’m aware, it seems intuitively obvious that delivering such a message would persuade a significant proportion of these at-risk individuals to scale back on CHD-inducing activities. The exact degree to which this would save lives is unclear. Having said that, even if just 1% of deaths caused by heart disease were prevented, 6,570 American lives would be saved every year.

Special thanks to Amayo Mordecai II, whose Medium post and GitHub repository this project is a replication of. Let’s get building!

1. Importing the Libraries

In order to avoid reinventing the wheel while training the CHD screening tool (acronymized as ‘iCHD’, in the interest of concision), we need to import the several libraries. Since we’ll be building the iCHD in Jupyter Notebook with Python, this includes:

numpy (multi-dimensional arrays and matrices)
pandas (data analysis and manipulation tools)
matplotlib (data visualization and graphical plotting)
seaborn (statistical graphs based on matplotlib)

The following lines of code import the libraries listed above:

Note that %matplotlib inline can only be invoked in IPyhon/Jupyter notebooks.

2. Loading the Dataset

Next, we need to import the dataset that we’ll use to train and test the iCHD. The one we’ll be using is an ongoing cardiovascular study on residents of Framingham, MA, which can be found on Kaggle here. The dataset contains 15 attributes (risk factors that may lead to having a 10-year risk of developing CHD), which can be considered as 5 different categories:

Demographic Profile

male (nominal; 0 = female, 1 = male)
age (continuous; truncated to whole numbers)

Cigarette Usage

currentSmoker (nominal; 0 = non-smoker, 1 = smoker)
cigsPerDay (continuous; estimated number of cigarettes smoked per day)

Historical Medical Profile

BPMeds (nominal; 0 = not on blood pressure medication, 1 = on Blood Pressure medication)
prevalentStroke (nominal; 0 = has not previously had a stroke, 1 = has previously had a stroke)
prevalentHyp (nominal; 0 = does not have hypertension; 1 = has hypertension)
diabetes (nominal 0 = does not have diabetes; 1 = does have diabetes)

Current Medical Profile

totChol (continuous; cholesterol level)
sysBP (continuous; systolic blood pressure)
diaBP (continuous; diastolic blood pressure)
BMI (continuous; body mass index)
heartRate (continuous; heart rate)
glucose (continuous; blood sugar level)

Target Variable

TenYearCHD (nominal; 0 = there is not a 10-year risk of developing CHD, 1 = there is a 10-year risk of developing CHD)

The 16th attribute, education, was found to have no correlation with heart disease, and is accordingly dropped from the dataset. We do this, and load in the rest of the dataset, with the following 2 lines of code:

3. Exploratory Data Analysis

Missing Data

While skim-reading the dataset, it becomes clear that a small — but nevertheless non-negligible — number of entries had missing values; a given patient, for instance, might have values for just 13 or 14 of the aforementioned attributes (or perhaps fewer), thus magnifying the likelihood of the iCHD’s Accuracy suffering as a result.

The first step in addressing this potentially problematic observation is investigating the nature of these missing entries. First, we find the total percentage of missing data as follows:

The total percentage of missing data is 12.74%

Moreover, we are able to receive a percentage breakdown by category (in a table form) for the missing data with the following 5 lines:

We can also represent this data graphically as follows:

There are 2 key findings from all of this:

We’re not missing that much data: only 12.74% of the entries are missing
A huge proportion (71.8%) of these entries are for the glucose attribute

It would therefore be sensible to remove all the the patients with at least 1 missing data point, rather than risk the possibility that these missing entries adversely affect the model’s performance. We achieve this with the lines of code shown below, leaving our dataset with 3751 full rows:

(3751, 15)

Data Distribution

The next thing on the agenda is ascertaining exactly how many of these 3751 patients did/didn’t end up developing heart disease 10 years down the road:

There are 3179 patients without heart disease and 572 patients with the disease

We can clearly see that the dataset is imbalanced. This phenomenon is also reflected in histograms for each of the 14, non-target attributes, as seen here:

It’s easy to see which 8 of our 14 non-target attributes are **continuous**.

As seen above, not a single one of our 3751 patients had a stroke in this time frame, which is certainly somewhat intriguing. Furthermore, fewer of our patients than expected were diabetic or on blood pressure medication.

CHD Risk against Age

We can now begin to investigate the correlations between individual attributes and 10-year CHD risk, starting with the age attribute:

As reflected by the blue bars, the most common age for patients with CHD to be was between 51 and 63. However, this approach to data representation perhaps warrants critique; the y-axis only gauges the raw number (and not the percentage) of people with or without CHD. Very few people older than 68, for example, must’ve taken part in the study, so it’s natural to expect the raw number of people aged 68 or older with CHD to be low. In fact, the graph illustrates that a greater percentage share of 68 year olds had CHD than any other age group, indicating that CHD risk doesn’t cease to climb with age.

Data Analysis

A smarter approach, by contrast, consists of representing in such a way that the proportions of those with and without CHD are easily visible — bar charts enable us to achieve this, as is shown below for 5 of the remaining attributes:

We can draw several takeaways from these bar charts, such as the fact that being male or hypertensive, having diabetes, or taking blood pressure meds are all correlated with an increased 10-year likelihood of developing CHD. Interestingly, smoking seems to have little to no effect on CHD risk, a result which is quite intriguing, since it is well known that chemicals in cigarette smoke increase the risk of blood clotting in arteries and veins.

Correlation Heat Map

Finally, we can create a heat map to visualize the correlations between the 15 attributes (which includes the target variable), with the following process:

As we can see, not a single one of our 14 non-target attributes has a correlation of greater than 0.23 with the 10-year risk of developing CHD. By contrast, a few of our non-target attributes are strongly correlated with one another (such as currentSmoker and cigsPerDay, diabetes and glucose, etc).

4. Top Features

Selection

Indeed, it is worthwhile reducing the number of input variables that are used to develop the iCHD — this enables us to train the model faster, and increase our anticipated Accuracy. To achieve this, we’ll be using the Boruta algorithm, a statistically grounded feature selection method. In a nutshell, a classifier (which, in our case, is random forest) is trained on the dataset. Random forest then evaluates and adjusts the importance of each feature at every iteration; this process repeats for 100 iterations, as seen here:

['age', 'totChol', 'sysBP', 'diaBP', 'BMI', 'heartRate', 'glucose']

Statistics

Halving the original 14 non-target attributes, the algorithm has identified the 7 most important features in our dataset. We can receive a concise summary of the strength of the association between each of these 7 features and our target variable, TenYearCHD, using the Odds Ratio. The code used and resulting table, for Confidence Intervals of both 5% and 95%, is shown below:

             CI 5%    CI 95%    Odds Ratio
age        1.011381  1.033813    1.022536
totChol    0.994963  0.999184    0.997071
sysBP      1.018236  1.031493    1.024843
diaBP      0.962258  0.984627    0.973378
BMI        0.929304  0.973798    0.951291
heartRate  0.963690  0.977730    0.970685
glucose    1.001074  1.007518    1.004291

Finally, we can create 49 pair plots — scatterplot matrices which allow us to see both the distribution of a single variable (which would be one of the 7 non-target attributes in each case) and the relationship between 2 variables (the same non-target attribute and our target variable, TenYearCHD):

5. Data Wrangling

SMOTE

Having previously identified our dataset as imbalanced, we need to balance our dataset in order to avoid the iCHD suffering from poor sensitivity or specificity. We achieve this with SMOTE, an oversampling technique which utilizes a k-nearest neighbour approach to generate synthetic samples from the minority class. In plain English, SMOTE increases the percentage of just the minority cases, in order to avoid the overfitting problem. The following lines of code implement this SMOTE approach and output the results:

{0: 3179, 1: 572} {0: 3178, 1: 2543}

We can visualize this incredible transformation with a before-after graph:

As seen below, positive cases have gone from occupying just 15.2% of our dataset pre-SMOTE to 44.4% of our cases post-balancing, as seen here:

Splitting the Data

Now that we’ve sufficiently balanced our dataset, we can split it up into our training data, which we use to build our model, before obtaining an unbiased validation of the now-built iCHD with our testing data. We achieve this with sklearn’s model selection method, as seen here:

Feature Scaling

The final stage of the data pre-processing component of our program is feature scaling, which involves normalizing the input data to ensure that all the features contribute equally to the model’s output, as shown here:

6. Model Predictions

At last, we can progress onto actually building our iCHD using our training data. Naturally, there are several different ML algorithms which can be leveraged to train our classifier. In order the maximize our odds of creating as successful of a model as possible, we will train 4 different classifiers, each of which corresponds to a different ML algorithm.

Logistic Regression

Our first classifier will be trained using Logistic Regression; this efficient statistical method can be thought of in terms of 2 different aspects:

The ‘regression’ aspect, which involves weighting parameters so that the curve fits the data as closely as possible.
The ‘logistic’ aspect, which is just the shape of said curve; in this case, it will be the S-shaped curve known as the logistic function.

First, we search for the optimal parameters using gridsearch, as follows:

Having found them, we can actually train the classifier with just these 2 lines:

{'C': 0.1, 'class_weight': None, 'penalty': 'l1'}

Next, we are able to evaluate the performance of our now-trained classifier on our testing data, and obtain the resulting Accuracy, as shown below:

Using logistic regression we get an accuracy of 67.6%

With an Accuracy of just 67.6%, this classifier clearly didn’t do a great job at identifying which patients had a 10-year CHD risk. We can get a visual breakdown of what went wrong with a confusion matrix, as follows:

Interestingly, there seem to be about as many false positives as there are false negatives. However, since more of our testing data patients didn’t have a 10-year CHD risk in the first place, there’s a 10.2% difference between the proportion of patients without a CHD risk which the model was able to correctly identify (72.0%), and the percentage of patients who do indeed have a CHD risk that the model was able to accurately pinpoint (61.8%).

Further to our obtaining of the Logistic Regression classifier’s Accuracy, we can get a full report, which includes other useful confusion metrics:

                precision   recall   f1-score  support

           0       0.71      0.72      0.72      647
           1       0.63      0.62      0.62      498

    accuracy                           0.68      1145
   macro avg       0.67      0.67      0.67      1145
weighted avg       0.68      0.68      0.68      1145

We can also obtain the overall F-1 score (a weighted harmonic mean of the classifier’s Precision and Recall, giving a balanced picture of performance):

The f1 score for logistic regression is 62.41%

Again, this score is fairly low — we’re ideally looking for an F-1 score of >85%.

The final way in which we evaluate the classifier’s performance is by finding the Area Under the ROC Curve (AUC). This visual measure is quite useful, as it aggregates performance metrics across all possible classification thresholds. Here’s our Logistic Regression AUC and the code used to obtain it:

Overall, the Logistic Regression classifier wasn’t as successful as we might’ve hoped; fortunately, we still have 3 more shots at training a classifier with an Accuracy and F-1 Score of over 85%. Luckily, a large majority of the code needed to train these remaining classifiers is structurally identical to the code used for Logistical Regression method.

k-Nearest Neighbours

The algorithm at the heart of our second classifier is k-NN. Quite simple but exceptionally effective, k-NN classifies data points according to similarity with other data points. That’s it. Little training is actually involved when using this method altogether. k-NN makes no generalizations or assumptions — the model is simply making it’s best, educated guess for each data point.

After ‘training’ our k-NN classifier, we validate it’s performance in the same ways as before, beginning with the model’s Accuracy and F-1 Scores:

Using k-nearest neighbours we get an accuracy of 82.53%The f1 score for k-nearest neighbours is 82.27%

Already, we can see the k-NN classifier significantly outperforming the Logistic Regression-based model. Here’s the k-NN confusion matrix:

As seen above, an outstanding 93.2% of patients with a 10-year CHD risk were correctly identified as such. However, the classifier did end up producing quite a few false positives (the light green box) — this is further reflected in the full classification report and AUC graph shown below:

                precision   recall   f1-score  support

           0       0.93      0.74      0.83      647
           1       0.74      0.93      0.82      498

    accuracy                           0.83      1145
   macro avg       0.84      0.84      0.83      1145
weighted avg       0.85      0.83      0.83      1145

Overall, the k-NN classifier did a pretty good job. Can one of our remaining 2 classifier’s trump its performance?

Decision Trees

Our third classifier entails a graph-theoretic approach at predictive modelling — in this tree-structured classifier, internal nodes correspond to attributes of a dataset, branches correspond to the decision rules, and leaf nodes denote outcomes (ie. class labels). Boolean responses to simple questions at each level of the decision tree are recursively given in order to produce a prediction for a given patient. Here are the classifier’s Accuracy & F-1 Score:

Using decision trees we get an accuracy of 72.4%The f1 score for decision trees is 67.62%

As seen above, the Decision Trees classifier’s success is slightly higher than, but in the same lacklustre ballpark as, the Logistic Regression model. This similarity of outcome is reflected in this classifier’s confusion matrix:

Fascinatingly, the ratio between the percentage of patients that were correctly identified to not have a CHD risk and the percentage of patients that were correctly identified to have a CHD risk for the Decision Trees classifier is 1.164:1. For comparison, the value of this same metric for the Logistic Regression classifier.is 1.165:1.

These salient similarities are further illustrated in the full report/AUC graphs:

                precision   recall   f1-score  support

           0       0.93      0.74      0.83      647
           1       0.74      0.93      0.82      498

    accuracy                           0.83      1145
   macro avg       0.84      0.84      0.83      1145
weighted avg       0.85      0.83      0.83      1145

With the k-NN approach clearly still representing the strongest classifier out of the 3 we’ve considered so far, we have just 1 model left to consider.

Support Vector Machine

SVM is a type of supervised learning ML algorithm, which segregates classes graphically with a decision boundary, and classifies patients according to which side of the boundary their data points fall on. SVM is known for having several significant advantages, including the use of:

regularization parameters (instrumental in mitigating over-fitting)
the kernel trick (takes features to a higher-dimensional space in order for individual instances thereof to be separated linearly in a larger space)

It is therefore not especially surprising that the SVM’s Accuracy and F-1 Score are the highest of all 4 classifiers that we’ve considered, as seen here:

Using SVM we get an accuracy of 86.46%The f1 score for SVM is 85.31%

Indeed, as depicted in the confusion matrix below, the SVM classifier misfired in just 155 out of a total possible 1145 trials — an impressive feat:

The other classification metrics are summarized in the table below:

                precision   recall   f1-score  support

           0       0.92      0.83      0.87      647
           1       0.81      0.90      0.85      498

    accuracy                           0.86      1145
   macro avg       0.86      0.87      0.86      1145
weighted avg       0.87      0.86      0.87      1145

Finally, as depicted below, the SVM classifier’s AUC works out to 0.924, an outstanding result compared to the values of 0.725, 0.838, and 0.773 for the Logistic Regression, k-NN and Decision Tree classifiers respectively:

Comparison

Having obtained the Accuracy, F-1 Score and AUC (3 metrics) for each of our classifiers, we can concisely summarize these 12 data points in a tabular form:

The Support Vector Machine classifier scores highest for all 3 metrics. However, this isn’t to say that the other classifiers performed badly, per se — they didn’t (especially k-NN). Perhaps this is best grasped in a visual format:

Accordingly, we can ditch the Logistic Regression, k-NN, and Decision Tree classifiers, and adopt the Support Vector Machine approach as our iCHD.

Closing Thoughts

With an Accuracy of approximately 86%, it seems as though the core idea behind the iCHD has been validated. In order to meaningfully upscale this result, we need to improve both the quantity and quality of our dataset:

Quantity: We only had a total of 3751 patients in our dataset, which is not particularly substantial. Indeed, with 15,000 patients, for example, assuming the dataset split remains unchanged, it wouldn’t be totally unreasonable to expect our Accuracy to creep into the low 90s.
Quality: In building the iCHD, most of the data on patients with a 10-year CHD risk was artificially synthesized using SMOTE. While this option is certainly preferable to working with an even more unbalanced dataset, nothing beats having truly representative data from real people. With these 2 ideas implemented, a 95%+ Accuracy isn’t out of the question.

1,800 Americans die every single day from a disease that’s highly preventable. It doesn’t have to be this way. With the appropriate improvements made, widespread distribution and usage, and effective communication of the results by doctors, the iCHD could put an end to this tragedy once and for all.

Special thanks once again to Amayo Mordecai II for his Medium post and accompanying dataset and code!