Heart disease has received a lot of attention in medical research as one of the many life-threatening diseases. The diagnosis of heart disease is a difficult task which when automated can offer better predictions about the patient's heart condition so that further treatment can be made effective. The signs, symptoms, and physical examination of the patient are usually used to make a diagnosis of heart disease. Resting blood pressure, cholesterol, age, sex, type of chest pain, fasting blood sugar, ST depression, and exercise-induced angina can all help to predict the likelihood of having a heart attack. Using models like Decision trees, Random forest and GLM to train on the given dataset and view the predicted class - 0 = less chance of heart attack, 1 = more chance of heart attack.
install.packages("rpart.plot")
install.packages("rattle")
install.packages("randomForest")
Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified) also installing the dependencies ‘bitops’, ‘XML’ Installing package into ‘/usr/local/lib/R/site-library’ (as ‘lib’ is unspecified)
library(rpart) #used for building classification and regression trees.
library(rpart.plot)
library(RColorBrewer) # help you choose sensible colour schemes for figures in R.
library(rattle) # provides a collection of utilities functions for a data scientist.
library(randomForest) #has the function randomForest() which is used to create and analyse random forests.
Loading required package: tibble Loading required package: bitops Rattle: A free graphical interface for data science with R. Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd. Type 'rattle()' to shake, rattle, and roll your data. randomForest 4.7-1 Type rfNews() to see new features/changes/bug fixes. Attaching package: ‘randomForest’ The following object is masked from ‘package:rattle’: importance
data = read.csv("/content/heart_health.csv")
print("Minimum resting blood pressure")
min(data$trestbps)
print("Maximum resting blood pressure")
max(data$trestbps)
print("Summary of Dataset")
summary(data)
[1] "Minimum resting blood pressure"
[1] "Maximum resting blood pressure"
[1] "Summary of Dataset"
age sex cp trestbps Min. :29.00 Min. :0.0000 Min. :0.000 Min. : 94.0 1st Qu.:47.50 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:120.0 Median :55.00 Median :1.0000 Median :1.000 Median :130.0 Mean :54.37 Mean :0.6832 Mean :0.967 Mean :131.6 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:140.0 Max. :77.00 Max. :1.0000 Max. :3.000 Max. :200.0 chol fbs restecg thalach Min. :126.0 Min. :0.0000 Min. :0.0000 Min. : 71.0 1st Qu.:211.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:133.5 Median :240.0 Median :0.0000 Median :1.0000 Median :153.0 Mean :246.3 Mean :0.1485 Mean :0.5281 Mean :149.6 3rd Qu.:274.5 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:166.0 Max. :564.0 Max. :1.0000 Max. :2.0000 Max. :202.0 exang oldpeak slope ca Min. :0.0000 Min. :0.00 Min. :0.000 Min. :0.0000 1st Qu.:0.0000 1st Qu.:0.00 1st Qu.:1.000 1st Qu.:0.0000 Median :0.0000 Median :0.80 Median :1.000 Median :0.0000 Mean :0.3267 Mean :1.04 Mean :1.399 Mean :0.7294 3rd Qu.:1.0000 3rd Qu.:1.60 3rd Qu.:2.000 3rd Qu.:1.0000 Max. :1.0000 Max. :6.20 Max. :2.000 Max. :4.0000 thal target Min. :0.000 Min. :0.0000 1st Qu.:2.000 1st Qu.:0.0000 Median :2.000 Median :1.0000 Mean :2.314 Mean :0.5446 3rd Qu.:3.000 3rd Qu.:1.0000 Max. :3.000 Max. :1.0000
print("Range of resting blood pressure")
max(data$trestbps) - min(data$trestbps)
quantile(data$trestbps, c(0.25, 0.5, 0.75))
print("Column names of the Data")
names(data)
print("Attributes of the Data")
str(data)
print("Number of Rows and Columns:")
dim(data)
[1] "Range of resting blood pressure"
[1] "Column names of the Data"
[1] "Attributes of the Data" 'data.frame': 303 obs. of 14 variables: $ age : int 63 37 41 56 57 57 56 44 52 57 ... $ sex : int 1 1 0 1 0 1 0 1 1 1 ... $ cp : int 3 2 1 1 0 0 1 1 2 2 ... $ trestbps: int 145 130 130 120 120 140 140 120 172 150 ... $ chol : int 233 250 204 236 354 192 294 263 199 168 ... $ fbs : int 1 0 0 0 0 0 0 0 1 0 ... $ restecg : int 0 1 0 1 1 1 0 1 1 1 ... $ thalach : int 150 187 172 178 163 148 153 173 162 174 ... $ exang : int 0 0 0 0 1 0 0 0 0 0 ... $ oldpeak : num 2.3 3.5 1.4 0.8 0.6 0.4 1.3 0 0.5 1.6 ... $ slope : int 0 0 2 2 2 1 1 2 2 2 ... $ ca : int 0 0 0 0 0 0 0 0 0 0 ... $ thal : int 1 2 2 2 2 1 2 3 3 2 ... $ target : int 1 1 1 1 1 1 1 1 1 1 ... [1] "Number of Rows and Columns:"
print("Correlation between the resting blood pressure and the age")
cor(data$trestbps, data$age, method = "pearson")
cor.test(data$trestbps, data$age, method = "pearson")
[1] "Correlation between the resting blood pressure and the age"
Pearson's product-moment correlation data: data$trestbps and data$age t = 5.0475, df = 301, p-value = 7.762e-07 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.1720897 0.3800657 sample estimates: cor 0.2793509
print("Constructing the Logistic regression Model")
glm(target~ trestbps+ restecg + fbs, data = data, family=binomial())
model <- glm(target~trestbps+ chol + thalach, data = data, family=binomial())
plot(model)
[1] "Constructing the Logistic regression Model"
Call: glm(formula = target ~ trestbps + restecg + fbs, family = binomial(), data = data) Coefficients: (Intercept) trestbps restecg fbs 1.98546 -0.01566 0.48160 0.03337 Degrees of Freedom: 302 Total (i.e. Null); 299 Residual Null Deviance: 417.6 Residual Deviance: 406.6 AIC: 414.6
# Make dependent variable as a factor (categorical)
data$target = as.factor(data$target)
# Splitting the dataset into test and train
print("Train Test Split") # 70/30 Split
dt = sort(sample(nrow(data), nrow(data)*.7))
train<-data[dt,]
val<-data[-dt,]
[1] "Train Test Split"
# No. of rows in Train and Val Dataset
nrow(train)
nrow(val)
print("Construction of the Decision Tree Model")
mtree <- rpart(
target ~ trestbps + chol + thalach,
data = train,
method="class",
control = rpart.control(
minsplit = 20,
minbucket = 7,
maxdepth = 10,
usesurrogate = 2,
xval =10
)
)
mtree
[1] "Construction of the Decision Tree Model"
n= 212 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 212 91 1 (0.42924528 0.57075472) 2) thalach< 147.5 81 27 0 (0.66666667 0.33333333) 4) chol>=275.5 21 2 0 (0.90476190 0.09523810) * 5) chol< 275.5 60 25 0 (0.58333333 0.41666667) 10) trestbps>=115 48 17 0 (0.64583333 0.35416667) 20) chol< 225.5 24 6 0 (0.75000000 0.25000000) * 21) chol>=225.5 24 11 0 (0.54166667 0.45833333) 42) chol>=244.5 16 5 0 (0.68750000 0.31250000) * 43) chol< 244.5 8 2 1 (0.25000000 0.75000000) * 11) trestbps< 115 12 4 1 (0.33333333 0.66666667) * 3) thalach>=147.5 131 37 1 (0.28244275 0.71755725) 6) chol>=222.5 91 31 1 (0.34065934 0.65934066) 12) chol< 232.5 10 4 0 (0.60000000 0.40000000) * 13) chol>=232.5 81 25 1 (0.30864198 0.69135802) 26) thalach< 174.5 68 24 1 (0.35294118 0.64705882) 52) chol< 301 49 20 1 (0.40816327 0.59183673) 104) chol>=273.5 15 5 0 (0.66666667 0.33333333) * 105) chol< 273.5 34 10 1 (0.29411765 0.70588235) * 53) chol>=301 19 4 1 (0.21052632 0.78947368) * 27) thalach>=174.5 13 1 1 (0.07692308 0.92307692) * 7) chol< 222.5 40 6 1 (0.15000000 0.85000000) *
# Plotting the Decision Tree for the dataset
print("Plotting the Decision Tree")
plot(mtree)
text(mtree)
par(xpd = NA, mar = rep(0.7, 4))
plot(mtree, compress = TRUE)
text(mtree, cex = 0.7, use.n = TRUE, fancy = FALSE, all = TRUE)
prp(mtree, faclen = 0,box.palette = "Reds", cex = 0.8, extra = 1)
[1] "Plotting the Decision Tree"
rf <- randomForest(target ~ trestbps + oldpeak + cp, data = data)
# View the forest results
print("Random Forest Results:")
print(rf)
[1] "Random Forest Results:" Call: randomForest(formula = target ~ trestbps + oldpeak + cp, data = data) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 1 OOB estimate of error rate: 25.41% Confusion matrix: 0 1 class.error 0 101 37 0.2681159 1 40 125 0.2424242
# Importance of each predictor
print("Importance of each predictor:")
print(importance(rf,type = 2))
# Plot the Random Forest
plot(rf)
[1] "Importance of each predictor:" MeanDecreaseGini trestbps 21.85269 oldpeak 35.36782 cp 36.12227