Bank Marketing Analysis



Original Source Here

Table of Contents:-

Introduction:

Nowadays, Marketing spending in the Bank industry is massive, that means it is essential for banks to reduce marketing strategies and should improve effectiveness. Understanding customers needs tends to more effective marketing plans, smarter product designs and customer satisfaction.

MAIN OBJECTIVE :

The main Objective is to improve the effectiveness of bank’s telemarketing campaign. This project will enable the bank to develop more better understanding of its customer base, predict customer’s response to its telemarketing campaign and establish a target customer profile for future marketing plans.

ABSTRACT :

The data in this project is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

There are four datasets:
1) bank-additional-full.csv with all examples (41188) and 20 inputs.
2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.
3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).
4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).
The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

ML FORMULATION:

This “Bank Marketing Dataset” comes from real telephone marketing data(phone calls) with 20 input variables such as age, job, marital, education and so on and 1 output variable y, it says whether the customer subscribed the term deposit or not(yes/no).And this is our main goal to predict y.

DATASET COLUMN ANALYSIS:

Input variables:
# bank client data:

  1. Age:- customer’s age.
  2. Job:-customer’s occupation/type of job such as entrepreneur, housemaid, retired, Student, Technician and so on.
  3. 3.Marital:-marital status means whether they are “Single, Married, Divorced, unknown”.
  4. Education:-This is categorized into “basic.4y,basic.6y,basic.9y,high.school,illiterate,professional.course,university.degree,unknown”.
  5. Default:-Default is the failure to repay a debt including interest or principal on a loan or security. This is categorized into “yes, no, unknown”.
  6. Housing:-Customer has Housing loan? This is categorized into “yes, no, unknown”.

# Last Contact of current campaign

  1. Loan:-Customer has a personal loan? This is categorized into “yes, no, unknown”.
  2. Contact:-Type of communication contact such as cellular telephone.
  3. Month:-Last contact month of the year such as “Jan, Feb, Mar,….Dec”.
  4. Day_of_week:-Last contact day of the week such as “Mon, Tue, Wed,…Sun”.
  5. Duration:-Last contact duration. If duration=0 then y=0.
  6. # Other Attributes
  7. Campaign:-The number of contacts performed during this campaign such as numeric and also includes last contacts.
  8. Pdays:-The number of days that passed by after the customer was last contacted from a previous campaign. This is categorized into “numeric and 999”.999 means customer was not contacted previously.
  9. Previous:-The number of contacts performed before this campaign. It will be in numeric.
  10. Poutcome:-This shows the outcome of the previous marketing campaign. This is categorized as “Failure, nonexistent, success”.

# Social and economic context attributes

  1. Emp-var-rate:-employment variable rate its an interest rate on loan or security that fluctuates over time because it is based on an underlying benchmark interest rate or index that changes periodically(quarterly-indicator).
  2. cons.price.idx:-Consumer price index(monthly-indicator).This is a measure used for estimation of price changes in a basket of products and services representative of consumption expenditure in an economy.
  3. cons.conf.idx:-Consumer confidence index(monthly-indicator).It is an economic indicator. It is the economy that consumers are expressing through their activities like spendings and savings.
  4. euribor3m:-euribor 3 month rate(Daily-indicator) .Euribor is a short form of Euro interbank offer rate. The Euro rates are based on interest rates at which a panel of european banks borrow funds from one another.
  5. nr.employed:-number of employees(quarterly-indicator).

# Output variable:-

  1. y:-It shows whether the client subscribed to the term deposit or not. It is categorized into (Yes/No).

TYPE OF ML PROBLEM:

It is a binary classification problem with two classes yes and no. Yes means customer has subscribed the term deposit, No means Customer did not subscribed the term deposit.

=> First we load the data, by using pandas library we read the data file that we want to.

=>Then we can We see shape of the data, it will be in rows and columns. By using the columns method we can see how many columns present in table. The describe() method is used for calculating some statistical data like percentile, mean and std of the numerical values of the series.

=>The info method is used to show us information about a dataframe including with index dtype and column dtype, Null values, Memory usage.

=>We don’t have Null values but we have ‘duration’, ‘contact’, ‘month’, ‘day_of_week’, ‘default’, ‘pdays’ these features are not so useful so its better to drop these features.

PERFORMANCE METRIC:

=>We used AUC as performance metric because its is imbalanced dataset.

NOTE: If dataset is balanced then we can use accuracy otherwise we should use AUC.

UNIVARIATE ANALYSIS OF CATEGORICAL VARIABLES:

=>We have Job, Marital, Housing, Poutcome, Education, Loan, Contact, Duration, Month, Default has Categorical variables.

=>Plots such as pdf,cdf, Box-plot, voilin plots comes use univariate.

UNIVARIATE ANALYSIS OF NUMERICAL VARIABLES:

=>we have Age, Campaign, Previous, emp.var.rate, euribor3m, cons.price.idx, nr.employed, Pdays, Duration has numerical values.

CORRELATION MATRIX:

We know that if you have a data set with many columns, a good way to quickly check correlations among columns is by visualizing the correlation matrix as a heatmap.

FEATURE IMPORTANCE:

=>Feature Importance refers to techniques that assign a score to input features based on how useful they are at predicting a target variable.

DATA PREPROCESSING:

Data preprocessing is a process of preparing the raw data and making it suitable for a machine learning model. It is the first and crucial step while creating a machine learning model.

When creating a machine learning project, it is not always a case that we come across the clean and formatted data. And while doing any operation with data, it is mandatory to clean it and put in a formatted way. So for this, we use data preprocessing task.

ONE HOT ENCODING:-

A One Hot Encoding allows the representation of categorical data to be more expressive. Many machine learning algorithms cannot work with categorical data directly. The categories must be converted into numbers. This is required for both input and output variables that are categorical. After doing One Hot Encoding the data is ready for applying any model.

MACHINE LEARNING ALGORITHMS:-

1) KNN Brute Force Algorithm:

Approximate K-Nearest Neighbor (K-NN) search using a brute force approach. Brute-force search is a very general problem-solving technique that consists of systematically enumerating all possible candidates for the solution and checking whether each candidate satisfies the problem’s statement
1. Compute all the distances between the query point and all points in the dataset.
2. Sort the computed distances.
3. Select the k Neighborhood points with the smallest distances.
4. Used majority voting to classify the query point
5. Repeat this for all query points.

2) KNN kd-Tree:

KD-trees are a specific data structure for efficiently representing our data. In particular, KD-trees helps organize and partition the data points based on specific conditions.

Let’s say we have a data set with 2 input features. We can represent our data as:

Now, we’re going to be making some axis aligned cuts, and maintaining lists of points that fall into each one of these different bins.

And what this structure allows us to do as we’re going to show, is efficiently prune our search space so that we don’t have to visit every single data point.

Now the question arises of how to draw these cuts?

1. One option is to split at the median value of the observations that are contained in the box.

2. You could also split at the centre point of the box, ignoring the spread of data within the box

Then a question is when do you stop?

There are a couple of choices that we have.

1. One is you can stop if there are fewer than a given number of points in the box. Let’s say m data points left.

2. Or if a minimum width to the box has been achieved.

So again, this second criteria would ignore the actual data in the box whereas the first one uses facts about the data to drive the stocking criterion. We can use the same distance metrics(“Euclidean distance” and “Hamming distance”) that we used while implementing KNN.

Intuitive Explanation of KD Trees:-

Suppose we have a data set with only two features.

Data point0: X = 0.54, Y = 0.93

Data point1: X = 0.96, Y = 0.86

Data point2: X = 0.42, Y = 0.67

Data point3: X = 0.11, Y = 0.53

Data point4: X = 0.64, Y = 0.29

Data point5: X = 0.27, Y = 0.75

Data point6: X = 0.81, Y = 0.63

Let’s split data into two groups.

We do it by comparing x with mean of max and min value.

Value = (Max + Min)/2

= (0.96 + 0.11)/2

= 0.53

At each node we will save 3 things.

· Dimension we split on

· Value we split on

· Tightest bounding box which contains all the points within that node.

Tight bounds: Node1, Node2

Node1: 0.11 <= X <= 0.42, 0.53 <= Y <= 0.75

Node2:0.54 <= X <= 0.96, 0.29 <= Y <= 0.93

Similarly dividing the structure into more parts on the basis of alternate dimensions until we get maximum 2 data points in a Node.

So now we plotted the points and divided them into various groups.

Let’s say now we have a query point ‘Q’ to which we have to find the nearest neighbor.

Using the tree we made earlier, we traverse through it to find the correct Node.

When using Node 3 to find the Nearest Neighbor.

But we can easily see, that it is in fact not the Nearest neighbor to the Query point.

We now traverse one level up, to Node 1. We do this because the nearest neighbor may not necessarily fall into the same node as the query point.

Do we need to inspect all remaining data points in Node 1 ?

We can check this by checking if the tightest box containing all the points of Node 4 is closer than the current near point or not.

This time, the bounding box for Node 4 lies within the circle, indicating that Node 4 may contain a point that’s closer to the query.

We now traverse one level up, to Root.

Do we need to inspect all remaining data points in Node 2 ?

We can check this by checking if the tightest box containing all the points of Node 2 is closer than the current Near point or not.

We can see that the Tightest box is far from the current Nearest point. Hence, we can prune that part of the tree.

Since we’ve traversed the whole tree, we are done: data point marked is indeed the true nearest neighbour of the query.

3) Logistic Regression with L1 regularization And L2 regularization:-

Logistic regression is classification technique. Assumption that the logistic regression will make is that the classes are almost or perfectly linearly separable. The task is to find hyper plane which is best in seperating the classes(positive class or negative class). It is a multiclass classifier i.e. it can be used for problems with more than two classes.

Most of the Machine Learning engineers and data scientists use Logistic Regression as base line model.

Here class label : 0, represents negative class and class label : 1, represents positive class and the line which is seperating the points is best hyper plane with normal as w.

w* = argmax∑i=1n yi*W^T*xi

w* is the best or optimal hyper plane which maximizes the sum of yi*W^T*xi

W^T means W transpose, W is normal to the hyper plane which we are dealing with and it is represented as a row vector.

optimization problem i.e.

w* = minimization of summation[log(1+exp(-zi))] →equation(1)

where zi = yi*W^T*xi is also known as signed distance.

if we pick W such that all the training points are correctly classified and all the zi tends to +infinity then we get the optimal w*.

If the all training points are correctly classified then we have problem of overfitting (means doing perfect job on training set but performing very badly on test set, i.e. errors on train data is almost zero but errors on test data are very high) and also if each zi tends to infinity then we will have the same problem to overcome this problem we use regularization techniques.

Regularization :

Regularization is a technique used to prevent overfitting problem. It adds a regularization term to the equation-1(i.e. optimisation problem) in order to prevent overfitting of the model.

The regression model which uses L1 regularization is called Lasso Regression and model which uses L2 is known as Ridge Regression.

Ridge Regression (L2 norm):

L2-norm loss function is also known as least squares error (LSE).

w* = minimization of ∑i=1n[log(1+exp(-zi))] + λ*∑ (wj )²

∑ (wj)² is a regularization term and ∑ [log(1+exp(-zi))] is the Loss term. λ is a hyper parameter.

We added the regularization term(i.e. squared magnitude) to the loss term to make sure that the model does not undergo overfit problem.

Here we will minimize both the Loss term and the regularization term. If hyper parameter(λ) is 0 then there is no regularization term then it will overfit and if hyper parameter(λ) is very large then it will add too much weight which leads to underfit.

We can find the best hyper parameter by using cross validation.

Lasso Regression (L1 norm):

L1-norm loss function is also known as least absolute deviations (LAD), least absolute errors (LAE).

In L1 regularization we use L1 norm instead of L2 norm

w* = argmin ∑[log(1+exp(-zi))] + λ* ||w||1

Here the L1 norm term will also avoid the model to undergo overfit problem. The advantage of using L1 regularization is Sparsity.

Sparsity:

A vector(w in this case) is said to be sparse when most of its cells(wi’s in this case) are zero.

w* is said to be sparse when the most of the wi’s are zeros.

If we use L1 regularization in Logistic Regression all the Less important features will become zero. If we use L2 regularization then the wi values will become small but not necessarily zero.

Here I am writing the code to check how the sparsity increases with increase in the hyper parameter value.

Here, we are going to check how sparsity increases as we increase lambda (or decrease C, as C= 1/ λ) when L1 Regularizer is used.

In code hyper parameter C is Inverse of regularization strength; It must be a positive float.

4) Linear SVM:-

Linear SVM is the newest extremely fast machine learning (data mining) algorithm for solving multiclass classification problems from ultra large data sets that implements an original proprietary version of a cutting plane algorithm for designing a linear support vector machine.

Linear SVM is a linearly scalable routine meaning that it creates an SVM model in a CPU time which scales linearly with the size of the training data set. Our comparisons with other known SVM models clearly show its superior performance when high accuracy is required. We would highly appreciate if you may share Linear SVM performance on your data sets with us.

5) RBF SVM:-

Gaussian RBF(Radial Basis Function) is another popular Kernel method used in SVM models for more. RBF kernel is a function whose value depends on the distance from the origin or from some point. Gaussian Kernel is of the following format:

||X1 — X2 || = Euclidean distance between X1 & X2

Using the distance in the original space we calculate the dot product (similarity) of X1 & X2.

Note: similarity is the angular distance between two points.

Parameters:

1.C: Inverse of the strength of regularization.

Behavior: As the value of ‘c’ increases the model gets overfits.

As the value of ‘c’ decreases the model underfits.

2. γ : Gamma (used only for RBF kernel)

Behavior: As the value of ‘ γ’ increases the model gets overfits.

As the value of ‘ γ’ decreases the model underfits.

6) Decision Tree:-

Decision tree nothing but nested if …else classifier. It means if ii had if…else with in the if …else. For example

Nested if…else condition represents a decision at leaf nodes we decide what class label we want to give this the structure of decision tree. The decision tree is highly interpret-ability because we have logical stuff.

Building a Decision Tree:-

Entropy:-

Entropy H(y) = -∑ P(y) log(P(y))

For example, we have 2 classes with 9 +ve and 5 ve points

P(y+ve) = 9/14 and P(y-ve) = 5/14
H(y) = -((9/14) log(9/14) +(5/14)log(5/14))
H(y) = 0.94

If both classes probabilities are equal then entropy is maximum 1. If one class fully dominate the other class then entropy is minimum zero.

7) Random Forest:-

Random Forest is one of the most popular Bagging technique. Random Forest in Random name comes from Random Bootstrap sampling and tree name comes from decision tree.

Random Forest = Decision Tree + Bagging + Column Sampling

Bagging:-

Sampling with replacement randomly picks a point put in the data set since i pick the point I don’t delete from the original data. This sampling is called Bootstrap sampling.

This sampling is called bootstrapped sampling and combined these all models are called aggregation. For classification, we use majority voting and for regression, we use mean or median.

When I change my training data and removing some points from the training the impact of aggregation result is very less because of bootstrapped sampling. Bagging says takes a bunch of low-bias, high- variance models and combine them using bagging we get low-bias and reduced -variance.

Random Forest = Decision Tree + Bagging + Column Sampling
Column sampling also called as feature bagging.

Random Forest = Decision Tree base learners of reasonable depth + row sampling with replacement + column sampling + aggregation

=> Random Forest is the simple implementation of the decision tree. It also not works very well if we have large dimensional data and categorical features with many categories. In the decision tree, bias-variance depends on depth but RF depends on the number base- learner. In RF we train to use deep trees and control variance using the right number of base- learner. The time complexity is high but performance is very good.

8) XGBOOST:-

XGBOOST is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. In prediction problems involving unstructured data (images, text, etc.) artificial neural networks tend to outperform all other algorithms or frameworks. However, when it comes to small-to-medium structured/tabular data, decision tree based algorithms are considered best-in-class right now. Please see the chart below for the evolution of tree-based algorithms over the years.

The algorithm differentiates itself in the following ways:

1. A wide range of applications: Can be used to solve regression, classification, ranking, and user-defined prediction problems.

2. Portability: Runs smoothly on Windows, Linux, and OS X.

3. Languages: Supports all major programming languages including C++, Python, R, Java, Scala, and Julia.

4. Cloud Integration: Supports AWS, Azure, and Yarn clusters and works well with Flink, Spark, and other ecosystems.

How to build an intuition for XGBoost?

Decision trees, in their simplest form, are easy-to-visualize and fairly interpretable algorithms but building intuition for the next-generation of tree-based algorithms can be a bit tricky. See below for a simple analogy to better understand the evolution of tree-based algorithms.

Imagine that you are a hiring manager interviewing several candidates with excellent qualifications. Each step of the evolution of tree-based algorithms can be viewed as a version of the interview process.

1. Decision Tree: Every hiring manager has a set of criteria such as education level, number of years of experience, interview performance. A decision tree is analogous to a hiring manager interviewing candidates based on his or her own criteria.

2. Bagging: Now imagine instead of a single interviewer, now there is an interview panel where each interviewer has a vote. Bagging or bootstrap aggregating involves combining inputs from all interviewers for the final decision through a democratic voting process.

1. Random Forest: It is a bagging-based algorithm with a key difference wherein only a subset of features is selected at random. In other words, every interviewer will only test the interviewee on certain randomly selected qualifications (e.g. a technical interview for testing programming skills and a behavioral interview for evaluating non-technical skills).

2. Boosting: This is an alternative approach where each interviewer alters the evaluation criteria based on feedback from the previous interviewer. This ‘boosts’ the efficiency of the interview process by deploying a more dynamic evaluation process.

3. Gradient Boosting: A special case of boosting where errors are minimized by gradient descent algorithm e.g. the strategy consulting firms leverage by using case interviews to weed out less qualified candidates.

4. XGBoost: Think of XGBoost as gradient boosting on ‘steroids’ (well it is called ‘Extreme Gradient Boosting’ for a reason!). It is a perfect combination of software and hardware optimization techniques to yield superior results using less computing resources in the shortest amount of time.

RESPONSE ENCODING:-

It is a technique to represent the categorical data while solving a machine learning classification problem. As part of this technique, we represent the probability of the data point belonging to a particular class given a category. So for a K-class classification problem, we get K new features which embed the probability of the datapoint belonging to each class based on the value of categorical data. Mathematically speaking, we calculate as

P(class=X | category=A) = P(category=A ∩ class=X) / P(category=A)

Consider the following example of a categorical dataset which contains values for variable ‘state’ and corresponding binary class label.

As part of response coding, first we compute a response table to represent the number of data points belonging to each output class for a given category.

Once we have the response table, we encode this information by adding the same number of features in the dataset as the cardinality of the class labels to represent the probability of the data point with given category, belonging to a particular class.

OneHotEncoder Performance Table:-

Response-Coding Performance Table:-

FINAL PIPELINE MODEL:-

The workflow of any machine learning project includes all the steps required to build it. A proper ML project consists of basically four main parts are given as follows:

1.Gathering data:
The process of gathering data depends on the project it can be real-time data or the data collected from various sources such as a file, database, survey and other sources.

2. Data pre-processing:
Usually, within the collected data, there is a lot of missing data, extremely large values, unorganized text data or noisy data and thus cannot be used directly within the model, therefore, the data require some pre-processing before entering the model.

3.Training and testing the model: Once the data is ready for algorithm application, It is then ready to put into the machine learning model. Before that, it is important to have an idea of what model is to be used which may give a nice performance output. The data set is divided into 3 basic sections i.e. The training set, validation set and test set. The main aim is to train data in the train set, to tune the parameters using ‘validation set’ and then test the performance test set.

4.Evaluation:
Evaluation is a part of the model development process. It helps to find the best model that represents the data and how well the chosen model works in the future. This is done after training of model in different algorithms is done. The main motto is to conclude the evaluation and choose model accordingly again.

References:-

  1. Applied ai Course,appliedaicourse.com
  2. If you want to go through code please visit https://github.com/Mahi98uppalapati/Bank-Marketing-Analysis/blob/main/BANK_MARKETING_Dataset_1.ipynb

Contact Details:-

Name: Uppalapati Mahitha

mail:mahi98uppalapati@gmail.com

linkedin profile:https://www.linkedin.com/in/mahitha-uppalapati-555492188/

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot



via WordPress https://ramseyelbasheer.io/2021/05/15/bank-marketing-analysis/

Popular posts from this blog

I’m Sorry! Evernote Has A New ‘Home’ Now

Jensen Huang: Racism is one flywheel we must stop

5 Best Machine Learning Books for ML Beginners