In this blog, our objective is to predict based on diagnostic measurements whether a patient has diabetes or not.
Diabetes is a common chronic disease and it is a great threat to human health.
The characteristic of diabetes is that the blood glucose is higher than the normal level, which is caused by defective insulin secretion or its impaired biological effects, or both.
Diabetes can lead to chronic damage and dysfunction of various tissues, especially eyes, kidneys, heart, blood vessels and nerves. Diabetes can be divided into two categories, type 1 diabetes (T1D) and type 2 diabetes (T2D).
Patients with type 1 diabetes are normally younger, mostly less than 30 years old. The typical clinical symptoms are increased thirst and frequent urination, high blood glucose levels. This type of diabetes cannot be cured effectively with oral medications alone and the patients are required insulin therapy.
Type 2 diabetes occurs more commonly in middle-aged and elderly people, which is often associated with the occurrence of obesity, hypertension, dyslipidemia, arteriosclerosis, and other diseases.
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. In this database there are nine columns:
1. Pregnancies: Number of times pregnant
2. Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. BloodPressure: Diastolic blood pressure (mm Hg)
4. SkinThickness: Triceps skin fold thickness (mm)
5. Insulin: 2-Hour serum insulin (mu U/ml)
6. BMI: Body mass index (weight in kg/(height in m)^2)
7. DiabetesPedigreeFunction: Diabetes pedigree function
8. Age: Age (years)
9. Outcome: Class variable (0 or 1)
This dataset is taken from UCI Machine Learning Repository. If you want to download click link below:
import pandas as pd
import numpy as np
df = pd.read_csv("input/pima-indians-diabetes.csv")
df.head()
df.shape
df.info()
df.columns
df.describe()
df.isnull().values.any()
df.isnull().sum()
df.isnull().head()
df.dropna(axis = 0, inplace = True)
df.shape
import seaborn as sns
import matplotlib.pyplot as plt
df['Outcome'].value_counts()
As per our dataset, 560 patients having no diabetes and 268 patients have diabetes.
plt.figure(figsize =(8, 6))
f = sns.countplot(x = 'Outcome', data = df)
f.set_title("Diabetic Patient Distribution")
f.set_xticklabels(['No', 'Yes'])
plt.xlabel("");
Correlation is used to denote association between two quantitative variables means relationship between two variables is called correlation. The degree of association is measured by a correlation coefficient. A correlation coefficient matrix is a simple table to summarize the correlations between all variables. Correlation give us a basic understanding of the relationship among variables of the dataset.
df.corr()
plt.figure(figsize =(10, 6))
sns.heatmap(df.corr(),annot=True)
When value of Glucose is higher than 110, patients are more like to be diabetic. Same when value of Insulin is higher than 150, patients are more like to be diabetic. Let us plot.
df['glucose_category'] = pd.cut(df['Glucose'], bins=list(np.arange(45, 200, 65)))
df['glucose_category']
df.head()
count_of_positive_diabetes_diagnosed = df[df['Outcome'] == 1].groupby('glucose_category')['Glucose'].count()
count_of_positive_diabetes_diagnosed
count_of_positive_diabetes_diagnosed.plot(kind='bar')
plt.title('Glucose Distribution of Diabetic Patients')
fig = plt.figure(figsize=(10,6))
sns.distplot(df['Glucose'], kde=True)
plt.show()
df['age_category'] = pd.cut(df['Age'], bins=list(np.arange(20, 80, 10)))
df['age_category']
count_of_positive_diabetes_diagnosed_by_age = df[df['Outcome'] == 1].groupby('age_category')['Age'].count()
count_of_positive_diabetes_diagnosed_by_age
count_of_positive_diabetes_diagnosed_by_age.plot(kind='bar')
plt.title('Age Distribution of Diabetic Patients')
fig = plt.figure(figsize=(10,6))
sns.distplot(df['Age'], kde = True)
plt.show()
fig = plt.figure(figsize=(10,6))
sns.distplot(df['BMI'], kde = True)
plt.show()
df[df['Outcome'] == 1].hist(figsize = (20,20))
plt.title('Diabetes Patients')
df[df['Outcome'] == 0].hist(figsize = (20,20))
plt.title('Non Diabetes Patients')
For the distribution of Non-diabetes people, the peak at 25 is very sharp and decrease very fast.
x = df.iloc[:,0:8]
y = df.iloc[:,8]
x.head()
y.head()
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state=42)
x_train.shape
y_train.shape
x_test.shape
y_test.shape
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
x_train = ss.fit_transform(x_train)
x_test = ss.fit_transform(x_test)
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train, y_train)
predictions = lr.predict(x_test)
print("Predicted value: ", predictions)
print("Actual value: ", y_test)
from sklearn.metrics import accuracy_score
print('Accuracy: ', accuracy_score(y_test, predictions))
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(x_train, y_train)
rfcpredictions = rfc.predict(x_test)
print("Predicted value: ", rfcpredictions)
print("Actual value: ", y_test)
print('Accuracy: ', accuracy_score(y_test, rfcpredictions))
from sklearn.svm import SVC
svc = SVC()
svc.fit(x_train, y_train)
svcpredictions = svc.predict(x_test)
print("Predicted value: ", svcpredictions)
print("Actual value: ", y_test)
print('Accuracy: ', accuracy_score(y_test, svcpredictions))
from sklearn.neighbors import KNeighborsClassifier
kn = KNeighborsClassifier()
kn.fit(x_train, y_train)
knprediction = kn.predict(x_test)
print("Predicted value: ", knprediction)
print("Actual value: ", y_test)
print('Accuracy: ', accuracy_score(y_test, knprediction))
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(random_state=0)
dtc.fit(x_train, y_train)
dtcprediction = dtc.predict(x_test)
print("Predicted value: ", dtcprediction)
print("Actual value: ", y_test)
print('Accuracy: ', accuracy_score(y_test, dtcprediction))
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()
gbc.fit(x_train, y_train)
gbcprediction = gbc.predict(x_test)
print("Predicted value: ", gbcprediction)
print("Actual value: ", y_test)
print('Accuracy: ', accuracy_score(y_test, gbcprediction))
print('Logistic Regression: ', accuracy_score(y_test, predictions))
print('Random Forest Classifier: ', accuracy_score(y_test, rfcpredictions))
print('Support Vector Classifier: ', accuracy_score(y_test, svcpredictions))
print('KNeighbors Classifier: ', accuracy_score(y_test, knprediction))
print('Decision Tree Classifier: ', accuracy_score(y_test, dtcprediction))
print('Gradient Boosting Classifier: ', accuracy_score(y_test, gbcprediction))
We have tried 6 different models. The performance scores of each model are list above. According to the prediction, the Logistic Regression model has the highest accuracy score of 78%.