In this blog, we will use data related to marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). We will use different type of model and see which model gives highest accuracy.
You can download dataset from given source.
The dataset is related to direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
There are four datasets:
1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010), very close to the data analyzed in [Moro et al., 2014]
2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.
3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).
4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).
The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).
The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).
Attribute Information:
Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)
Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
import pandas as pd
inputFile = "input/bank.csv"
df = pd.read_csv(inputFile, sep=';')
df.head()
We have successfully loaded data into memory.
df.shape
We have 4521 rows and 17 columns in our banking dataset.
df.dtypes
We can see, some columns are object types, so we will have to convert them to into numerical data. First we will convert into feature and target.
df.describe()
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="whitegrid")
We have type of job: (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown'). Let us see the count of each type of job.
job_count = df['job'].value_counts()
job_count
plt.figure(figsize = (8, 5))
job_count.plot(kind = "bar")
plt.title("Type of Job Distribution")
Column default syes that client has credit in default or not. It has categorical value: 'no','yes','unknown'.
default_count = df['default'].value_counts()
default_count
plt.figure(figsize = (8, 5))
default_count.plot(kind='bar').set(title='Default Column Distribution')
marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
marital_count = df['marital'].value_counts()
marital_count
plt.figure(figsize = (8, 5))
marital_count.plot(kind = "bar").set(title = "Merital Distribution")
loan_count = df['loan'].value_counts()
loan_count
plt.figure(figsize = (8, 5))
loan_count.plot(kind = "bar").set(title = "Loan Distribution")
As per data, some client has taken the personal loan.
housing_count = df['housing'].value_counts()
housing_count
plt.figure(figsize = (8, 5))
housing_count.plot(kind = "bar").set(title = "Housing Loan Distribution")
Most of the client has taken the housing loan.
education_count = df['education'].value_counts()
education_count
plt.figure(figsize = (8, 5))
education_count.plot(kind = "bar").set(title = "Eduction Column Distribution")
Contact column says client were contacted by cellular or telephone.
contact_count = df['contact'].value_counts()
contact_count
plt.figure(figsize = (8, 5))
contact_count.plot(kind = "bar").set(title = "Contact Column Distribution")
month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
month_count = df['month'].value_counts()
month_count
plt.figure(figsize = (8, 5))
month_count.plot(kind = "bar").set(title = "Month Data Distribution")
plt.figure(figsize = (8, 5))
df['pdays'].hist(bins = 50)
plt.figure(figsize = (8, 5))
df[df['pdays'] > 0]['pdays'].hist(bins=50)
target_count = df['y'].value_counts()
target_count
plt.figure(figsize = (8, 5))
target_count.plot(kind = "bar").set(title = "Target Distribution")
df[df['y'] == 'yes'].hist(figsize = (20,20))
plt.title('Client has subscribed a term deposite')
df[df['y'] == 'no'].hist(figsize = (20,20))
plt.title('Client has not subscribed a term deposite')
df.head(10)
We can see there are some binary columns(default, housing, loan) which are object type, we need to convert into numeric value.
There are categorical columns also, but there are a limited number of choices. They are job, marital, education, contact, month, and poutcome. That also need to converted into numerical format.
All feature columns we need to convert into numeric values then only we can feed into the model.
We can convert the yes values to 1, and the no values to 0 for default column. We will lamda function for tis
df['is_default'] = df['default'].apply(lambda row: 1 if row == 'yes' else 0)
df[['default','is_default']].tail(10)
For housing column also we will do the same.
df['is_housing'] = df['housing'].apply(lambda row: 1 if row == 'yes' else 0)
df[['housing','is_housing']].tail(10)
df['is_loan'] = df['loan'].apply(lambda row: 1 if row == 'yes' else 0)
df[['loan', 'is_loan']].tail(10)
df['target'] = df['y'].apply(lambda row: 1 if row == 'yes' else 0)
df[['y', 'target']].tail(10)
For marital column, we have three values married, single and divorced. We will use pandas' get_dummies function to convert categorical variable into dummy/indicator variables.
pandas.get_dummies(data, prefix=None, prefixsep='', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
Convert categorical variable into dummy/indicator variables.
Parameters
data: array-like, Series, or DataFrame
Data of which to get dummy indicators.
prefix: str, list of str, or dict of str, default None
String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, prefix can be a dictionary mapping column names to prefixes.
prefix_sep: str, default ‘_’
If appending prefix, separator/delimiter to use. Or pass a list or dictionary as with prefix.
dummy_na: bool, default False
Add a column to indicate NaNs, if False NaNs are ignored.
columns: list-like, default None
Column names in the DataFrame to be encoded. If columns is None then all the columns with object or category dtype will be converted.
sparse: bool, default False
Whether the dummy-encoded columns should be backed by a SparseArray (True) or a regular NumPy array (False).
drop_first: bool, default False
Whether to get k-1 dummies out of k categorical levels by removing the first level.
dtyped: type, default np.uint8
Data type for new columns. Only a single dtype is allowed.
Returns
DataFrame
Dummy-coded data.
marital_dummies = pd.get_dummies(df['marital'], prefix = 'marital')
marital_dummies.tail()
pd.concat([df['marital'], marital_dummies], axis=1).head(n=10)
We can see in each of the rows there is one value of 1, which is in the column corresponding the value in the marital column.
There are three values, if two of the values in the dummy columns are 0 for a particular row, then the remaining column must be equal to 1. It is important to eliminate any redundancy and correlations in features as it becomes difficult to determine which feature is most important in minimizing the total error.
So let us remove one column divorced.
marital_dummies.drop('marital_divorced', axis=1, inplace=True)
marital_dummies.head()
df = pd.concat([df, marital_dummies], axis=1)
df.head()
job_dummies = pd.get_dummies(df['job'], prefix = 'job')
job_dummies.tail()
job_dummies.drop('job_unknown', axis=1, inplace=True)
df = pd.concat([df, job_dummies], axis=1)
df.head()
education_dummies = pd.get_dummies(df['education'], prefix = 'education')
education_dummies.tail()
education_dummies.drop('education_unknown', axis=1, inplace=True)
education_dummies.tail()
df = pd.concat([df, education_dummies], axis=1)
df.head()
contact_dummies = pd.get_dummies(df['contact'], prefix = 'contact')
contact_dummies.tail()
contact_dummies.drop('contact_unknown', axis=1, inplace=True)
contact_dummies.tail()
df = pd.concat([df, contact_dummies], axis=1)
df.head()
poutcome_dummies = pd.get_dummies(df['poutcome'], prefix = 'poutcome')
poutcome_dummies.tail()
poutcome_dummies.drop('poutcome_unknown', axis=1, inplace=True)
poutcome_dummies.tail()
df = pd.concat([df, poutcome_dummies], axis=1)
df.head()
months = {'jan':1, 'feb':2, 'mar':3, 'apr':4, 'may':5, 'jun':6, 'jul':7, 'aug':8, 'sep':9, 'oct':10, 'nov':11, 'dec': 12}
df['month'] = df['month'].map(months)
df['month'].head()
'pdays' column indicates the number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted). If the value of 'pdays' is '-1', if so we will associate that with a value of 0,
df[df['pdays'] == -1]['pdays'].count()
df['was_contacted'] = df['pdays'].apply(lambda row: 0 if row == -1 else 1)
df[['pdays','was_contacted']].head()
df.drop(['job', 'education', 'marital', 'default', 'housing', 'loan', 'contact', 'pdays', 'poutcome', 'y'], axis=1, inplace=True)
df.dtypes
df.head(10)
#The axis=1 argument drop columns
X = df.drop('target', axis=1)
y = df['target']
X.shape
y.shape
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 32)
X_train.shape
y_train.shape
X_test.shape
y_test.shape
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred[:10]
print("Predicted value: ", y_pred[:10])
print("Actual value: ", y_test[:10])
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_pred = y_pred, y_true = y_test)
print(f'Accuracy of the model Logistic Regression is {accuracy*100:.2f}%')
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
rfcpredictions = rfc.predict(X_test)
print("Predicted value: ", rfcpredictions[:10])
print("Actual value: ", y_test[:10])
accuracy = accuracy_score(y_pred = rfcpredictions, y_true = y_test)
print(f'Accuracy of the Random Forest Classifier model is {accuracy*100:.2f}%')
from sklearn.svm import SVC
svc = SVC()
svc.fit(X_train, y_train)
svcpredictions = svc.predict(X_test)
print("Predicted value: ", svcpredictions[:10])
print("Actual value: ", y_test[:10])
accuracy = accuracy_score(y_pred = svcpredictions, y_true = y_test)
print(f'Accuracy of the SVC model is {accuracy*100:.2f}%')
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(random_state=0)
dtc.fit(X_train, y_train)
dtcprediction = dtc.predict(X_test)
print("Predicted value: ", dtcprediction[:10])
print("Actual value: ", y_test[:10])
accuracy = accuracy_score(y_pred = dtcprediction, y_true = y_test)
print(f'Accuracy of the Decision Tree Classifier model is {accuracy*100:.2f}%')
We can see Random Forest Classifier model gives us highest accuracy 91.49%.