Predict Diabetes With Machine Learning Algorithms

In this blog, our objective is to predict based on diagnostic measurements whether a patient has diabetes or not.

Diabetes is a common chronic disease and it is a great threat to human health.

The characteristic of diabetes is that the blood glucose is higher than the normal level, which is caused by defective insulin secretion or its impaired biological effects, or both.

Diabetes can lead to chronic damage and dysfunction of various tissues, especially eyes, kidneys, heart, blood vessels and nerves. Diabetes can be divided into two categories, type 1 diabetes (T1D) and type 2 diabetes (T2D).

Patients with type 1 diabetes are normally younger, mostly less than 30 years old. The typical clinical symptoms are increased thirst and frequent urination, high blood glucose levels. This type of diabetes cannot be cured effectively with oral medications alone and the patients are required insulin therapy.

Type 2 diabetes occurs more commonly in middle-aged and elderly people, which is often associated with the occurrence of obesity, hypertension, dyslipidemia, arteriosclerosis, and other diseases.

Pima Indians Diabetes Data set information?

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage. In this database there are nine columns:

1. Pregnancies: Number of times pregnant
2. Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. BloodPressure: Diastolic blood pressure (mm Hg)
4. SkinThickness: Triceps skin fold thickness (mm)
5. Insulin: 2-Hour serum insulin (mu U/ml)
6. BMI: Body mass index (weight in kg/(height in m)^2)
7. DiabetesPedigreeFunction: Diabetes pedigree function
8. Age: Age (years)
9. Outcome: Class variable (0 or 1)

This dataset is taken from UCI Machine Learning Repository. If you want to download click link below:

Download

Load the data

import pandas as pd
import numpy as np

df = pd.read_csv("input/pima-indians-diabetes.csv")
df.head()

Dataset details

Shape of dataframe

df.shape

(768, 9)

View dataframe information

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

View dataset column names

df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

Statistical summary of data

df.describe()

Data pre-processing

Check null records

df.isnull().values.any()

False

df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

df.isnull().head()

Drop records if there is null values

df.dropna(axis = 0, inplace = True)

df.shape

(768, 9)

Visualize data

import seaborn as sns
import matplotlib.pyplot as plt

View diabetic and non diabetic patient

df['Outcome'].value_counts()

0    500
1    268
Name: Outcome, dtype: int64

As per our dataset, 560 patients having no diabetes and 268 patients have diabetes.

Plot diabetic patient distribution

plt.figure(figsize =(8, 6))
f = sns.countplot(x = 'Outcome', data = df)
f.set_title("Diabetic Patient Distribution")
f.set_xticklabels(['No', 'Yes'])
plt.xlabel("");

Correlation between all variables

Correlation is used to denote association between two quantitative variables means relationship between two variables is called correlation. The degree of association is measured by a correlation coefficient. A correlation coefficient matrix is a simple table to summarize the correlations between all variables. Correlation give us a basic understanding of the relationship among variables of the dataset.

df.corr()

Correlation plot for all variables

plt.figure(figsize =(10, 6))
sns.heatmap(df.corr(),annot=True)

<AxesSubplot:>

Distribution of glucose of diabetic patients

When value of Glucose is higher than 110, patients are more like to be diabetic. Same when value of Insulin is higher than 150, patients are more like to be diabetic. Let us plot.

df['glucose_category'] = pd.cut(df['Glucose'], bins=list(np.arange(45, 200, 65)))
df['glucose_category']

0      (110.0, 175.0]
1       (45.0, 110.0]
2                 NaN
3       (45.0, 110.0]
4      (110.0, 175.0]
            ...      
763     (45.0, 110.0]
764    (110.0, 175.0]
765    (110.0, 175.0]
766    (110.0, 175.0]
767     (45.0, 110.0]
Name: glucose_category, Length: 768, dtype: category
Categories (2, interval[int64]): [(45, 110] < (110, 175]]

df.head()

count_of_positive_diabetes_diagnosed = df[df['Outcome'] == 1].groupby('glucose_category')['Glucose'].count()
count_of_positive_diabetes_diagnosed

glucose_category
(45, 110]      42
(110, 175]    178
Name: Glucose, dtype: int64

count_of_positive_diabetes_diagnosed.plot(kind='bar')
plt.title('Glucose Distribution of Diabetic Patients')

Text(0.5, 1.0, 'Glucose Distribution of Diabetic Patients')

Distribution of glucose by diabetes

fig = plt.figure(figsize=(10,6))
sns.distplot(df['Glucose'], kde=True)
plt.show()

/opt/tljh/user/lib/python3.7/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

Age distribution of diabetes patients

df['age_category'] = pd.cut(df['Age'], bins=list(np.arange(20, 80, 10)))
df['age_category']

0      (40, 50]
1      (30, 40]
2      (30, 40]
3      (20, 30]
4      (30, 40]
         ...   
763    (60, 70]
764    (20, 30]
765    (20, 30]
766    (40, 50]
767    (20, 30]
Name: age_category, Length: 768, dtype: category
Categories (5, interval[int64]): [(20, 30] < (30, 40] < (40, 50] < (50, 60] < (60, 70]]

count_of_positive_diabetes_diagnosed_by_age = df[df['Outcome'] == 1].groupby('age_category')['Age'].count()
count_of_positive_diabetes_diagnosed_by_age

age_category
(20, 30]    90
(30, 40]    76
(40, 50]    64
(50, 60]    31
(60, 70]     7
Name: Age, dtype: int64

count_of_positive_diabetes_diagnosed_by_age.plot(kind='bar')
plt.title('Age Distribution of Diabetic Patients')

Text(0.5, 1.0, 'Age Distribution of Diabetic Patients')

Distribution of Age

fig = plt.figure(figsize=(10,6))
sns.distplot(df['Age'], kde = True)
plt.show()

/opt/tljh/user/lib/python3.7/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

Distribution of BMI by Diabetes

fig = plt.figure(figsize=(10,6))
sns.distplot(df['BMI'], kde = True)
plt.show()

/opt/tljh/user/lib/python3.7/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
  warnings.warn(msg, FutureWarning)

Plot Diabetes Patients

df[df['Outcome'] == 1].hist(figsize = (20,20))
plt.title('Diabetes Patients')

Text(0.5, 1.0, 'Diabetes Patients')

Plot Non Diabetes Patients

df[df['Outcome'] == 0].hist(figsize = (20,20))
plt.title('Non Diabetes Patients')

Text(0.5, 1.0, 'Non Diabetes Patients')

For the distribution of Non-diabetes people, the peak at 25 is very sharp and decrease very fast.

Create feature and target columns

x = df.iloc[:,0:8]
y = df.iloc[:,8]

x.head()

y.head()

0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64

Split data into train and test

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.2, random_state=42)

x_train.shape

(614, 8)

y_train.shape

(614,)

x_test.shape

(154, 8)

y_test.shape

(154,)

Rescale training and test data

from sklearn.preprocessing import StandardScaler

ss = StandardScaler()

x_train = ss.fit_transform(x_train)
x_test = ss.fit_transform(x_test)

Logistic Regression

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

Train model

lr.fit(x_train, y_train)

LogisticRegression()

Predict test data

predictions = lr.predict(x_test)

View predicted and actual value

print("Predicted value: ", predictions)
print("Actual value: ", y_test)

Predicted value:  [0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1
 0 0 1 0 1 1 0 0 0 1 0 0 1 0 1 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1 0 1 1 0 0 0
 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 1 0 0 1 0 1 0
 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0
 0 1 0 0 0 0]
Actual value:  668    0
324    0
624    0
690    0
473    0
      ..
355    1
534    0
344    0
296    1
462    0
Name: Outcome, Length: 154, dtype: int64

View the accuracy

from sklearn.metrics import accuracy_score

print('Accuracy: ', accuracy_score(y_test, predictions))

Accuracy:  0.7857142857142857

Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

Create RandomForestClassifier model

rfc = RandomForestClassifier()

Train model

rfc.fit(x_train, y_train)

RandomForestClassifier()

Predict test data

rfcpredictions = rfc.predict(x_test)

print("Predicted value: ", rfcpredictions)
print("Actual value: ", y_test)

Predicted value:  [0 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 1 1 1 0 1 1
 1 0 1 0 0 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1 0 1 0 0 0 0
 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 0 1 0 0 1 0 1 0
 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0
 0 1 0 0 0 0]
Actual value:  668    0
324    0
624    0
690    0
473    0
      ..
355    1
534    0
344    0
296    1
462    0
Name: Outcome, Length: 154, dtype: int64

View accuracy

print('Accuracy: ', accuracy_score(y_test, rfcpredictions))

Accuracy:  0.7792207792207793

SVC (support Vector Classifier)

from sklearn.svm import SVC

Create model

svc = SVC()

Train model

svc.fit(x_train, y_train)

SVC()

svcpredictions = svc.predict(x_test)

print("Predicted value: ", svcpredictions)
print("Actual value: ", y_test)

Predicted value:  [0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 0
 0 0 1 0 1 1 0 0 1 1 0 0 1 0 1 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 0
 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 1 0
 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 1 0 0 0 0]
Actual value:  668    0
324    0
624    0
690    0
473    0
      ..
355    1
534    0
344    0
296    1
462    0
Name: Outcome, Length: 154, dtype: int64

print('Accuracy: ', accuracy_score(y_test, svcpredictions))

Accuracy:  0.7402597402597403

KNeighborsClassifier

from sklearn.neighbors import KNeighborsClassifier

Create a model

kn = KNeighborsClassifier()

Train the model

kn.fit(x_train, y_train)

KNeighborsClassifier()

Predict the test data

knprediction = kn.predict(x_test)

print("Predicted value: ", knprediction)
print("Actual value: ", y_test)

Predicted value:  [0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 1 1 1 1 0 1
 0 0 1 0 1 1 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 1 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0
 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
 0 1 0 0 0 0]
Actual value:  668    0
324    0
624    0
690    0
473    0
      ..
355    1
534    0
344    0
296    1
462    0
Name: Outcome, Length: 154, dtype: int64

print('Accuracy: ', accuracy_score(y_test, knprediction))

Accuracy:  0.6948051948051948

Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier(random_state=0)

dtc.fit(x_train, y_train)

DecisionTreeClassifier(random_state=0)

dtcprediction = dtc.predict(x_test)

print("Predicted value: ", dtcprediction)
print("Actual value: ", y_test)

Predicted value:  [0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 0 1 1 1 1 0 1 1
 1 0 1 0 0 1 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0
 0 1 0 0 0 0 1 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 0 1 0
 1 0 1 1 1 1 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 1 0 1 0 0]
Actual value:  668    0
324    0
624    0
690    0
473    0
      ..
355    1
534    0
344    0
296    1
462    0
Name: Outcome, Length: 154, dtype: int64

print('Accuracy: ', accuracy_score(y_test, dtcprediction))

Accuracy:  0.7402597402597403

GradientBoostingClassifier

from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier()

gbc.fit(x_train, y_train)

GradientBoostingClassifier()

gbcprediction = gbc.predict(x_test)

print("Predicted value: ", gbcprediction)
print("Actual value: ", y_test)

Predicted value:  [0 0 0 0 1 1 0 1 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 1 1 1 1 1 1
 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0
 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 1 0 1 0 0 1 0 0 0
 1 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0
 0 1 0 0 0 0]
Actual value:  668    0
324    0
624    0
690    0
473    0
      ..
355    1
534    0
344    0
296    1
462    0
Name: Outcome, Length: 154, dtype: int64

print('Accuracy: ', accuracy_score(y_test, gbcprediction))

Accuracy:  0.7337662337662337

Model performance summary

Accuracy

print('Logistic Regression: ', accuracy_score(y_test, predictions))
print('Random Forest Classifier: ', accuracy_score(y_test, rfcpredictions))
print('Support Vector Classifier: ', accuracy_score(y_test, svcpredictions))
print('KNeighbors Classifier: ', accuracy_score(y_test, knprediction))
print('Decision Tree Classifier: ', accuracy_score(y_test, dtcprediction))
print('Gradient Boosting Classifier: ', accuracy_score(y_test, gbcprediction))

Logistic Regression:  0.7857142857142857
Random Forest Classifier:  0.7792207792207793
Support Vector Classifier:  0.7402597402597403
KNeighbors Classifier:  0.6948051948051948
Decision Tree Classifier:  0.7402597402597403
Gradient Boosting Classifier:  0.7337662337662337

We have tried 6 different models. The performance scores of each model are list above. According to the prediction, the Logistic Regression model has the highest accuracy score of 78%.

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	3.845052	120.894531	69.105469	20.536458	79.799479	31.992578	0.471876	33.240885	0.348958
std	3.369578	31.972618	19.355807	15.952218	115.244002	7.884160	0.331329	11.760232	0.476951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.078000	21.000000	0.000000
25%	1.000000	99.000000	62.000000	0.000000	0.000000	27.300000	0.243750	24.000000	0.000000
50%	3.000000	117.000000	72.000000	23.000000	30.500000	32.000000	0.372500	29.000000	0.000000
75%	6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000	1.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
Pregnancies	1.000000	0.129459	0.141282	-0.081672	-0.073535	0.017683	-0.033523	0.544341	0.221898
Glucose	0.129459	1.000000	0.152590	0.057328	0.331357	0.221071	0.137337	0.263514	0.466581
BloodPressure	0.141282	0.152590	1.000000	0.207371	0.088933	0.281805	0.041265	0.239528	0.065068
SkinThickness	-0.081672	0.057328	0.207371	1.000000	0.436783	0.392573	0.183928	-0.113970	0.074752
Insulin	-0.073535	0.331357	0.088933	0.436783	1.000000	0.197859	0.185071	-0.042163	0.130548
BMI	0.017683	0.221071	0.281805	0.392573	0.197859	1.000000	0.140647	0.036242	0.292695
DiabetesPedigreeFunction	-0.033523	0.137337	0.041265	0.183928	0.185071	0.140647	1.000000	0.033561	0.173844
Age	0.544341	0.263514	0.239528	-0.113970	-0.042163	0.036242	0.033561	1.000000	0.238356
Outcome	0.221898	0.466581	0.065068	0.074752	0.130548	0.292695	0.173844	0.238356	1.000000

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	False	False	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome	glucose_category
0	6	148	72	35	0	33.6	0.627	50	1	(110.0, 175.0]
1	1	85	66	29	0	26.6	0.351	31	0	(45.0, 110.0]
2	8	183	64	0	0	23.3	0.672	32	1	NaN
3	1	89	66	23	94	28.1	0.167	21	0	(45.0, 110.0]
4	0	137	40	35	168	43.1	2.288	33	1	(110.0, 175.0]

Predict Diabetes With Machine Learning Algorithms

Predict Diabetes With Machine Learning Algorithms

Pima Indians Diabetes Data set information?

Load the data

Dataset details

Shape of dataframe

View dataframe information

View dataset column names

Statistical summary of data

Data pre-processing

Check null records

Drop records if there is null values

Visualize data

View diabetic and non diabetic patient

Plot diabetic patient distribution

Correlation between all variables

Correlation plot for all variables

Distribution of glucose of diabetic patients

Distribution of glucose by diabetes

Age distribution of diabetes patients

Distribution of Age

Distribution of BMI by Diabetes

Plot Diabetes Patients

Plot Non Diabetes Patients

Create feature and target columns

Split data into train and test

Rescale training and test data

Logistic Regression

Train model

Predict test data

View predicted and actual value

View the accuracy

Random Forest Classifier

Create RandomForestClassifier model

Train model

Predict test data

View accuracy

SVC (support Vector Classifier)

Create model

Train model

KNeighborsClassifier

Create a model

Train the model

Predict the test data

Decision Tree Classifier

GradientBoostingClassifier

Model performance summary

Accuracy

kindergarten

Python for kids

Fourier series

Linear Equations

Geometry

Laplace

Vectors

Differential equations

Functions

Jacobian

Lagrangian

Waves

Electromagnetism

Optics

Quantum mechanics concepts

Theory of relativity

Kinematics

Thermodynamics

Formulae

A level physics

Chemistry

English

Geography

Animation

Plotting

SVG

Python

Machine Learning

TensorFlow

PySpark

PyTorch

Natural Language Processing